如何将 unicode 字符串输出到 RTF(使用 C#)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1368020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 15:44:19  来源:igfitidea点击:

How to output unicode string to RTF (using C#)

c#unicodertfcodepoint

提问by Emir

I'm trying to output unicode string into RTF format. (using c# and winforms)

我正在尝试将 unicode 字符串输出为 RTF 格式。(使用 c# 和 winforms)

From wikipedia:

来自维基百科

If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.

如果需要 Unicode 转义,则使用控制字 \u,后跟一个 16 位有符号十进制整数,给出 Unicode 代码点编号。为了不支持 Unicode 的程序的利益,后面必须跟在指定代码页中该字符的最接近的表示形式。例如,\u1576? 将给出阿拉伯字母 beh,指定不支持 Unicode 的旧程序应将其呈现为问号。

I don't know how to convert Unicode character into Unicode codepoint ("\u1576"). Conversion to UTF 8, UTF 16 and similar is easy, but I don't know how to convert to codepoint.

我不知道如何将 Unicode 字符转换为 Unicode 代码点(“\u1576”)。转换为 UTF 8、UTF 16 和类似的很容易,但我不知道如何转换为代码点。

Scenario in which I use this:

我使用它的场景:

  • I read existing RTF file into string (I'm reading template)
  • string.replace #TOKEN# with MyUnicodeString (template is populate with data)
  • write result into another RTF file.
  • 我将现有的 RTF 文件读入字符串(我正在阅读模板)
  • string.replace #TOKEN# 与 MyUnicodeString (模板填充数据)
  • 将结果写入另一个 RTF 文件。

Problem, arise when Unicode characters arrived

问题,Unicode字符到达时出现

采纳答案by Eric Smith

Provided that all the characters that you're catering for exist in the Basic Multilingual Plane(it's unlikely that you'll need anything more), then a simple UTF-16 encoding should suffice.

假设您要迎合的所有字符都存在于基本多语言平面中(您不太可能需要更多字符),那么简单的 UTF-16 编码就足够了。

Wikipedia:

维基百科:

All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.

从 U+0000 到 U+10FFFF 的所有可能的代码点,除了代理代码点 U+D800–U+DFFF(不是字符),都由 UTF-16 唯一映射,而不管代码点的当前或未来字符分配或使用。

The following sample program illustrates doing something along the lines of what you want:

以下示例程序说明了按照您想要的方式执行某些操作:

static void Main(string[] args)
{
    // ?
    char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 });
    var sw = new StreamWriter(@"c:/helloworld.rtf");
    sw.WriteLine(@"{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World!
}"); 
    sw.Close();
}

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

The important bit is the Convert.ToUInt32(c)which essentially returns the code point value for the character in question. The RTF escape for unicode requires a decimal unicode value. The System.Text.Encoding.Unicodeencoding corresponds to UTF-16 as per the MSDN documentation.

重要的一点是Convert.ToUInt32(c)本质上返回所讨论字符的代码点值。unicode 的 RTF 转义需要十进制 unicode 值。根据System.Text.Encoding.UnicodeMSDN 文档,编码对应于 UTF-16。

回答by Ian Kemp

You will have to convert the string to a byte[]array (using Encoding.Unicode.GetBytes(string)), then loop through that array and prepend a \and ucharacter to all Unicode characters you find. When you then convert the array back to a string, you'd have to leave the Unicode characters as numbers.

您必须将字符串转换为byte[]数组(使用Encoding.Unicode.GetBytes(string)),然后循环遍历该数组并将 a\u字符添加到您找到的所有 Unicode 字符。当您然后将数组转换回字符串时,您必须将 Unicode 字符保留为数字。

For example, if your array looks like this:

例如,如果您的数组如下所示:

byte[] unicodeData = new byte[] { 0x15, 0x76 };

it would become:

它会变成:

// 5c = \, 75 = u
byte[] unicodeData = new byte[] { 0x5c, 0x75, 0x15, 0x76 };

回答by Hogan

Fixed code from accepted answer - added special character escaping, as described in this link

已接受答案中的固定代码 - 添加了特殊字符转义,如此链接中所述

static string GetRtfUnicodeEscapedString(string s)
{
    var sb = new StringBuilder();
    foreach (var c in s)
    {
        if(c == '\' || c == '{' || c == '}')
            sb.Append(@"\" + c);
        else if (c <= 0x7f)
            sb.Append(c);
        else
            sb.Append("\u" + Convert.ToUInt32(c) + "?");
    }
    return sb.ToString();
}

回答by Yongtao Wang

Based on the specification, here are some code in java which is tested and works:

根据规范,这里有一些经过测试和工作的java代码:

  public static String escape(String s){
        if (s == null) return s;

        int len = s.length();
        StringBuilder sb = new StringBuilder(len);
        for (int i = 0; i < len; i++){
            char c = s.charAt(i);
            if (c >= 0x20 && c < 0x80){
                if (c == '\' || c == '{' || c == '}'){
                    sb.append('\');
                }
                sb.append(c);
            }
            else if (c < 0x20 || (c >= 0x80 && c <= 0xFF)){
                sb.append("\'");
                sb.append(Integer.toHexString(c));
            }else{
                sb.append("\u");
                sb.append((short)c);
                sb.append("??");//two bytes ignored
            }
        }
        return sb.toString();
 }

The important thing is, you need to append 2 characters (close to the unicode character or just use ? instead) after the escaped uncode. because the unicode occupy 2 bytes.

重要的是,您需要在转义的未编码后附加 2 个字符(接近 unicode 字符或仅使用 ? 代替)。因为unicode占用2个字节。

Also the spec says your should use negative value if the code point greater than 32767, but in my test, it's fine if you don't use negative value.

规范还说,如果代码点大于 32767,您应该使用负值,但在我的测试中,如果您不使用负值,那很好。

Here is the spec:

这是规范:

\uN This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.

\uN 这个关键字代表一个单一的 Unicode 字符,它没有基于当前 ANSI 代码页的等效 ANSI 表示。N 表示以十进制数表示的 Unicode 字符值。此关键字后紧跟 ANSI 表示中的等效字符。这样,老读者会忽略 \uN 关键字并正确地选择 ANSI 表示。当遇到这个关键字时,读者应该忽略接下来的 N 个字符,其中 N 对应于遇到的最后一个 \ucN 值。

As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.

与所有 RTF 关键字一样,可能存在关键字终止空格(在 ANSI 字符之前),该空格不计入要跳过的字符。虽然这不太可能发生(或推荐),但 \bin 关键字、它的参数和后面的二进制数据被视为一个字符以供跳过。如果在扫描可跳过数据时遇到 RTF 范围分隔符字符(即左括号或右括号),则认为可跳过数据在分隔符之前结束。这使得阅读器可以执行一些基本的错误恢复。要在可跳过的数据中包含 RTF 定界符,它必须使用适当的控制符号(即,用反斜杠转义)表示,如纯文本。出于计算可跳过字符的目的,任何 RTF 控制字或符号都被视为单个字符。

An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.

RTF 编写器在遇到没有对应 ANSI 字符的 Unicode 字符时,应输出 \uN 后跟它可以管理的最佳 ANSI 表示。此外,如果 Unicode 字符转换为 ANSI 字符流,其字节数与当前的 Unicode 字符字节数不同,则应在 \uN 关键字之前发出 \ucN 关键字以通知读者更改。

RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative number

RTF 控制字通常接受带符号的 16 位数字作为参数。为此,大于 32767 的 Unicode 值必须表示为负数