在 .NET 中从 HTML 获取纯文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5870438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get plain text from HTML in .NET
提问by Daniel Pe?alba
What is the best way to get a plain text string from an HTML string?
从 HTML 字符串中获取纯文本字符串的最佳方法是什么?
public string GetPlainText(string htmlString)
{
// any .NET built in utility?
}
Thanks in advance
提前致谢
采纳答案by Rudi Visser
There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:
据我所知,没有内置实用程序,但根据您的要求,您可以使用正则表达式去除所有标签:
string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");
回答by Alex K.
You can use MSHTML, which can be pretty forgiving;
你可以使用 MSHTML,它可以很宽容;
//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? & who?" });
string txt = htmldoc2.body.outerText;
Plateau of Leng 2 sugars please what? & who?
高原冷2糖请什么?& WHO?
回答by Alex
There is no built-in solution in the framework.
框架中没有内置解决方案。
If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.
如果您需要解析 HTML,我使用名为HTML Agility Pack的库获得了很好的经验。
它解析一个 HTML 文件并通过 DOM 提供对它的访问,类似于 XML 类。
回答by Alex
Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.
就个人而言,我发现正则表达式和 HttpUtility 的组合是最好和最短的解决方案。
Return HttpUtility.HtmlDecode(
Regex.Replace(HtmlString, "<(.|\n)*?>", "")
)
This removes all the tags, and then decodes any of the extras like <
or >
这将删除所有标签,然后解码任何额外内容,如<
或>
回答by Erick Petrucelli
There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.
没有 .NET 内置方法可以做到这一点。但是,就像@rudi_visser 指出的那样,它可以使用正则表达式来完成。
If you need to remove more than just the tags (i.e., turn âto a), you can use a more elaborated solution, like found here.
如果您需要删除的不仅仅是标签(即,将â变为a),您可以使用更详细的解决方案,例如在这里找到。