C# 用 & 符号解析 XML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1473826/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 17:53:45  来源:igfitidea点击:

parsing XML with ampersand

c#xmlxelement

提问by paradisontheitroad

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?

我有一个包含 XML 的字符串,我只想将它解析为 Xelement,但它有一个&符号。我在用 HtmlDecode 解析它时仍然有问题。有什么建议?

string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"; 

XElement.Parse(HttpUtility.HtmlDecode(test));

I also added these methods to replace those characters, but I am still getting XMLException.

我还添加了这些方法来替换这些字符,但我仍然收到 XMLException。

string encodedXml = test.Replace("&", "&amp;").Replace("<", "&lt;").Replace(">", "&gt;").Replace("\"", "&quot;").Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);

t or Even tried it with this:

t 甚至尝试过这个:

string newContent=  SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);

回答by Tommy Carlier

If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.

如果您的字符串不是有效的 XML,则不会解析。如果它本身包含一个&符号,则它不是有效的 XML。与 HTML 不同,XML 非常严格。

回答by Wim ten Brink

The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &amp;.

&符号使 XML 无效。这不能通过样式表修复,因此您需要使用其他工具或 VB/C#/PHP/Delphi/Lisp/Etc 中的代码编写代码。将其删除或将其翻译为 &。

回答by Justin Niessner

Your string doesn't contain valid XML, that's the issue. You need to change your string to:

您的字符串不包含有效的 XML,这就是问题所在。您需要将字符串更改为:

<MyXML><SubXML><XmlEntry Element="test" value="wow&amp;" /></SubXML></MyXML>"

回答by AlexS

You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.

您应该“编码”而不是解码。但是调用 HttpUtility.HtmlEncode 对您没有帮助,因为它也会对您的 '<' 和 '>' 符号进行编码,并且您的字符串将不再是 XML。

I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)

我认为对于这种情况,最好的解决方案是将 '&' 替换为 '& amp;' (没有空间)

回答by Colin

HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:

HtmlEncode 不会解决这个问题,它可能会创建更多的&符号(例如,' 可能会变成 ",这是一个 Xml 实体引用,如下所示:

&amp;   & 
&apos;  ' 
&quot;  " 
&lt;    < 
&gt;    > 

But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML(that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.

但是你可能会得到像   这样的东西,这在 html 中很好,但在 Xml 中却没有。因此,就像其他人所说的那样,首先通过确保属于您的 XML 的实际标记的一部分的任何字符(也就是说,您的 xml 中的任何内容作为变量或文本)并且出现在实体引用中来更正 xml list 被转换为它们对应的实体(所以 < 会变成 <)。如果包含非法字符的文本是 xml 节点内的文本,您可以采用简单的方法并用 CDATA 元素包围文本,但这对属性不起作用。

回答by Ahmad Mageed

Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.

理想情况下,XML 在您的代码使用它之前被正确转义。如果这超出了您的控制范围,您可以编写一个正则表达式。不要使用 String.Replace 方法,除非您绝对确定这些值不包含其他转义项。

For example, "wow&amp;".Replace("&", "&amp;")results in wow&amp;amp;which is clearly undesirable.

例如,"wow&amp;".Replace("&", "&amp;")结果wow&amp;amp;显然是不受欢迎的。

Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as &lt;, something like:

Regex.Replace 可以为您提供更多控制以避免这种情况,并且可以编写为仅匹配不属于其他字符的“&”符号,例如&lt;,类似于:

string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&amp;");

The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as &nbsp;and the list can grow.

上述工作,但不可否认,它没有涵盖以与号开头的各种其他字符,例如&nbsp;和 列表可以增长。

A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&amp;"the decode process would return "&wow&"then re-encoding it would return "&amp;wow&amp;", which is desirable. To pull this off you could use this:

更灵活的方法是解码 value 属性的内容,然后重新编码。如果您有value="&wow&amp;"解码过程将返回"&wow&"然后重新编码它将返回"&amp;wow&amp;",这是可取的。要做到这一点,你可以使用这个:

string result = Regex.Replace(test, @"value=\""(.*?)\""", m => "value=\"" +
    HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
    "\"");
var doc = XElement.Parse(result);

Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.

请记住,上述正则表达式仅针对 value 属性的内容。如果 XML 结构中的其他区域存在相同的问题,则可以对其进行调整以匹配它们并以类似的方式替换它们的内容。



EDIT:编辑:更新的解决方案应该处理标签之间的内容以及双引号之间的任何内容。请务必对此进行彻底测试。尝试使用正则表达式操作 XML/HTML 标签是不利的,因为它容易出错且过于复杂。您的情况有些特殊,因为您需要先对其进行消毒才能使用它。

string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
            m.Groups["start"].Value +
            HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
            m.Groups["end"].Value);
var doc = XElement.Parse(result);

回答by Wilfred Springer

Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTMLis doing to have the ability to ignore ampersands not used as entity references.

也许可以考虑编写自己的 XMLDocumentScanner。这就是NekoHTML正在做的事情,以便能够忽略不用作实体引用的&符号。

回答by Filip Stankiewicz

This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.

这是最简单也是最好的方法。适用于所有字符,并允许为任何 Web 服务调用(即 SharePoint ASMX)解析 XML。

public string XmlEscape(string unescaped)
        {
            XmlDocument doc = new XmlDocument();
            var node = doc.CreateElement("root");
            node.InnerText = unescaped;
            return node.InnerXml;
        }

回答by TheAtomicOption

Filip'sanswer is on the right track, but you can hiHyman the System.Xml.XmlDocumentclass to do this for you without an entire new utility function.

Filip 的答案是正确的,但是您可以劫持System.Xml.XmlDocument该类来为您执行此操作,而无需使用全新的实用程序功能。

XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;