从 C# 中的字符串中去除字节顺序标记

Question

提问by TrueWill

I've read similar posts on this and they don't answer my question.

我读过类似的帖子，但他们没有回答我的问题。

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

在 C# 中，我有一个从 WebClient.DownloadString 获取的字符串。我已经尝试将 client.Encoding 设置为新的 UTF8Encoding(false)，但这并没有什么区别——我仍然在结果字符串的开头得到了 UTF-8 的字节顺序标记。我需要删除它（用 LINQ 解析生成的 XML），并希望在内存中这样做。

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

所以我有一个以 \x00EF\x00BB\x00BF 开头的字符串，如果它存在，我想删除它。现在我正在使用

if (xml.StartsWith(ByteOrderMarkUtf8))
{
    xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

但这只是感觉不对。我已经尝试了各种带有流、GetBytes 和编码的代码，但没有任何效果。任何人都可以提供从字符串中剥离 BOM 的“正确”算法吗？

Thank you!

谢谢！

Answer 1

采纳答案by Martin v. L?wis

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point. Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

如果变量 xml 是字符串类型，那么您已经做错了 - 在字符串中，BOM 不应表示为三个单独的字符，而应表示为单个代码点。不使用 DownloadString，而是使用 DownloadData，并解析字节数组。XML 解析器应该识别 BOM 本身，并跳过它（自动检测文档编码为 UTF-8 除外）。

Answer 2

回答by Andrew Arnott

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[])to get the string rather than download the buffer AS a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, unicode characters will probably be misinterpreted, resulting in a corrupted string.

将字节缓冲区（通过 DownloadData）传递给以string Encoding.UTF8.GetString(byte[])获取字符串，而不是将缓冲区下载为字符串。您当前的方法可能有更多的问题，而不仅仅是修剪字节顺序标记。除非你按照我在这里的建议正确解码它，否则 unicode 字符可能会被误解，导致字符串损坏。

Edit: Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

编辑：Martin 的答案更好，因为它避免为仍然需要解析的 XML 分配整个字符串。我给出的答案最适用于不需要解析为 XML 的一般字符串。

Answer 3

回答by TrueWill

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading filesI found that this worked:

我有一些不正确的测试数据，这让我有些困惑。基于如何避免在读取文件时绊倒 UTF-8 BOM，我发现这有效：

private readonly string _byteOrderMarkUtf8 =
    Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

public string GetXmlResponse(Uri resource)
{
    string xml;

    using (var client = new WebClient())
    {
        client.Encoding = Encoding.UTF8;
        xml = client.DownloadString(resource);
    }

    if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
    {
        xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
    }

    return xml;
}

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

正确设置客户端 Encoding 属性将 BOM 减少为单个字符。但是，XDocument.Parse 仍然不会读取该字符串。这是迄今为止我想出的最干净的版本。

Answer 4

回答by Vivek Ayer

This works as well

这也有效

int index = xmlResponse.IndexOf('<');
if (index > 0)
{
    xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}

Answer 5

回答by Steven Oxley

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStreamobject with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytescontains your XML in UTF8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

我有一个非常相似的问题（我需要解析一个 XML 文档，该文档表示为一个字节数组，在它的开头有一个字节顺序标记）。我使用了 Martin 对他的回答的评论之一来找到解决方案。我获取了我拥有的字节数组（而不是将其转换为字符串）并MemoryStream用它创建了一个对象。然后我将它传递给XDocument.Load，这就像一个魅力。例如，假设xmlBytes包含以 UTF8 编码的 XML，并在它的开头带有一个字节标记。然后，这将是解决问题的代码：

var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);

It's that simple.

就这么简单。

If starting out with a string, it should still be easy to do (assume xmlis your string containing the XML with the byte order mark):

如果从字符串开始，它应该仍然很容易（假设xml您的字符串包含带有字节顺序标记的 XML）：

var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);

Answer 6

回答by PJUK

I recently had issues with the .net 4 upgrade but until then the simple answer is

我最近在 .net 4 升级方面遇到了问题，但在那之前，简单的答案是

String.Trim()

removes the BOM up until .net 3.5 However in .net 4 you need to change it slightly

删除 BOM 直到 .net 3.5 但是在 .net 4 中你需要稍微改变它

String.Trim(new char[]{'\uFEFF'});

That will also get rid of the Byte order mark, though you may also want to remove the ZERO WIDTH SPACE U+200B

这也将摆脱字节顺序标记，尽管您可能还想删除零宽度空间 U+200B

String.Trim(new char[]{'\uFEFF','\u200B'});

This you could also use to remove other unwanted characters

这也可以用来删除其他不需要的字符

Some further information from http://msdn.microsoft.com/en-us/library/t97s7bs3.aspx

来自http://msdn.microsoft.com/en-us/library/t97s7bs3.aspx 的一些进一步信息

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

.NET Framework 3.5 SP1 和更早版本维护此方法修剪的空白字符的内部列表。从 .NET Framework 4 开始，该方法会修剪所有 Unicode 空白字符（即，在将它们传递给 Char.IsWhiteSpace 方法时会产生真正返回值的字符）。由于此更改，.NET Framework 3.5 SP1 和更早版本中的 Trim 方法删除了两个字符，零宽度空间 (U+200B) 和零宽度无中断空间 (U+FEFF)，即 .NET Framework 3.5 SP1 和更早版本中的 Trim 方法。 NET Framework 4 及更高版本不会删除。此外，.NET Framework 3.5 SP1 及更早版本中的 Trim 方法不会修剪三个 Unicode 空白字符：MONGOLIAN VOWEL SEPARATOR (U+180E)、NARROW NO-BREAK SPACE (U+202F) 和 MEDIUM MATHEMATICAL SPACE (U+205F)。

Answer 7

回答by Andrew Thompson

I wrote the following postafter coming across this issue.

遇到这个问题后，我写了以下帖子。

Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

本质上，我没有使用 BinaryReader 类读取文件内容的原始字节，而是使用带有特定构造函数的 StreamReader 类，该构造函数会自动从我试图检索的文本数据中删除字节顺序标记字符。

Answer 8

回答by Tiago Gouvêa

A quick and simple method to remove it directyl from a string:

从字符串中直接删除它的快速简单的方法：

private static string RemoveBom(string p)
{
     string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
     if (p.StartsWith(BOMMarkUtf8))
         p = p.Remove(0, BOMMarkUtf8.Length);
     return p.Replace("string yourCleanString=RemoveBom(yourBOMString);
", "");
}

How to use:

如何使用：

StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);

Answer 9

回答by lucasjam

public static string GetUTF8String(byte[] data)
{
    byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
    if (data.StartsWith(utf8Preamble))
    {
        return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
    }
    else
    {
        return Encoding.UTF8.GetString(data);
    }
}

Answer 10

回答by Timothy

I ran into this when I had a base-64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[]of the file to the string (Based lightly on TrueWill's answer):

当我有一个 base-64 编码的文件要转换为字符串时，我遇到了这个问题。虽然我可以将它保存到一个文件然后正确读取它，但这是我能想到的从byte[]文件到字符串的最佳解决方案（仅基于 TrueWill 的回答）：

public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
   // Handle invalid/unexpected input
   // (nulls, thisArray.Length < otherArray.Length, etc.)

   for (int i = 0; i < otherArray.Length; ++i)
   {
       if (thisArray[i] != otherArray[i])
       {
           return false;
       }
   }

   return true;
}

Where StartsWith(byte[])is the logical extension:

StartsWith(byte[])逻辑扩展在哪里：

##代码##

从 C# 中的字符串中去除字节顺序标记

提问by TrueWill

采纳答案by Martin v. L?wis

回答by Andrew Arnott

回答by TrueWill

回答by Vivek Ayer

回答by Steven Oxley

回答by PJUK

回答by Andrew Thompson

回答by Tiago Gouvêa

回答by lucasjam

回答by Timothy

相关推荐

最近更新

标签

从 C# 中的字符串中去除字节顺序标记

提问by TrueWill

采纳答案by Martin v. L?wis

回答by Andrew Arnott

回答by TrueWill

回答by Vivek Ayer

回答by Steven Oxley

回答by PJUK

回答by Andrew Thompson

回答by Tiago Gouvêa

回答by lucasjam

回答by Timothy

相关推荐

C# 删除事件处理程序

C# 实体框架如何处理递归层次结构？Include() 似乎不适用于它

C# 您将如何获得队列中的第一个和最后一个项目？

C# 不抛出异常时，try/catch 块是否会影响性能？

相关推荐

最近更新

标签