C# 如何判断一个字符串是否为xml？

Question

提问by si618

We have a string field which can contain XML or plain text. The XML contains no <?xmlheader, and no root element, i.e. is not well formed.

我们有一个字符串字段，它可以包含 XML 或纯文本。XML 不包含<?xml标题，也没有根元素，即格式不正确。

We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.

我们需要能够编辑 XML 数据，清空元素和属性值，只留下它们的名称，所以我需要在编辑之前测试这个字符串是否是 XML。

Currently I'm using this approach:

目前我正在使用这种方法：

string redact(string eventDetail)
{
    string detail = eventDetail.Trim();
    if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
    ...

Is there a better way?

有没有更好的办法？

Are there any edge cases this approach could miss?

这种方法是否会遗漏任何边缘情况？

I appreciate I could use XmlDocument.LoadXmland catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.

我很感激我可以使用XmlDocument.LoadXml和 catch XmlException，但这感觉是一个昂贵的选择，因为我已经知道很多数据不会在 XML 中。

Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:

下面是一个 XML 数据的例子，除了缺少一个根元素（为了节省空间，因为会有很多数据而被省略），我们可以假设它是格式良好的：

<TableName FirstField="Foo" SecondField="Bar" /> 
<TableName FirstField="Foo" SecondField="Bar" /> 
...

Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.

目前我们只使用基于属性的值，但如果数据变得更复杂，我们将来可能会使用元素。

SOLUTION

解决方案

Based on multiple comments (thanks guys!)

基于多条评论（谢谢大家！）

string redact(string eventDetail)
{
    if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
    string detail = eventDetail.Trim();
    if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
    XmlDocument xml = new XmlDocument();
    try
    {
        xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
    }
    catch (XmlException e)
    {
        log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
        return eventDetail;
    }
    ... // redact

Answer 1

采纳答案by Samuel Carrijo

One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.

一种可能性是混合两种解决方案。您可以使用 redact 方法并尝试加载它（在 if 内）。这样，您将只尝试加载可能是格式良好的 xml，并丢弃大部分非 xml 条目。

Answer 2

回答by lod3n

If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.

如果您首先要接受格式不正确的 XML，我认为捕获异常是处理它的最佳方法。

Answer 3

回答by JaredPar

If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.

如果您的目标是可靠性，那么最好的选择是使用 XmlDocument.LoadXml 来确定它是否是有效的 XML。数据的完整解析可能很昂贵，但这是可靠地判断它是否为有效 XML 的唯一方法。否则，您未在缓冲区中检查的任何字符都可能导致数据为非法 XML。

Answer 4

回答by Pavel Minaev

If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:

如果 XML 不包含根元素（即它是一个 XML 片段，而不是一个完整的文档），那么以下示例也将是完全有效的示例 - 但与您的检测器不匹配：

foo<bar/>baz

In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!

事实上，任何文本字符串都将是有效的 XML 片段（考虑一下原始 XML 文档是否只是包装一些文本的根元素，而您将根元素标签拿走）！

Answer 5

回答by Ira Baxter

Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance, are you OK with accepting

取决于您想要的测试准确度。考虑到您还没有正式的 <xml，您已经在尝试检测不是 XML 的东西。理想情况下，您应该通过完整的 XML 解析器（如您建议的 LoadXML）来解析文本；它拒绝的任何东西都不是 XML。问题是，您是否关心是否接受非 XML 字符串？例如，你是否同意接受

  <the quick brown fox jumped over the lazy dog's back>

as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.

作为 XML 和剥离它？如果是这样，你的技术很好。如果没有，您必须决定您想要的测试有多严密，并以该程度的严密程度编写识别器。

Answer 6

回答by Noon Silk

How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.

数据是如何传给你的？围绕它的其他类型的数据是什么？也许有更好的方法；也许您可以对您控制的数据进行标记，然后推断不在这些标记内的任何内容都是 XML，但我们需要了解更多信息。

Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).

像这样的可爱解决方案失败了，我认为你所拥有的很好（用于验证它以这些字符开头和结尾）。

We need to know more about the data format really.

我们真的需要更多地了解数据格式。

Answer 7

回答by Evgeny

try
{
    XmlDocument myDoc = new XmlDocument();
    myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
    //take care of the exception
}

C# 如何判断一个字符串是否为xml？

提问by si618

采纳答案by Samuel Carrijo

回答by lod3n

回答by JaredPar

回答by Pavel Minaev

回答by Ira Baxter

回答by Noon Silk

回答by Evgeny

相关推荐

最近更新

标签

C# 如何判断一个字符串是否为xml？

提问by si618

采纳答案by Samuel Carrijo

回答by lod3n

回答by JaredPar

回答by Pavel Minaev

回答by Ira Baxter

回答by Noon Silk

回答by Evgeny

相关推荐

Linux 寻找独特的线条

C# 删除事件处理程序

Linux 尝试使用 sudo 将文件附加到根拥有的文件时权限被拒绝

C# LINQ：无法将类型“System.Collections.Generic.IEnumerable<int>”隐式转换为“int”

相关推荐

最近更新

标签