使用 C# 解析 HTML 以获取内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2038104/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing HTML to get content using C#
提问by Mike B
I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfectly human-readable.
我正在编写一个应用程序来抓取我的一组网页。我不想获取页面的整个源代码,而是想获取所有内容并存储它,并且能够将页面作为纯文本存储在数据库中。内容将在其他应用程序中使用,用户不会阅读,因此不需要完全可读。
At first, I was thinking of using regular expressions, but I have no control over the validity of the web pages and there is a great chance that no regular expression would give me the content.
起初,我想使用正则表达式,但我无法控制网页的有效性,并且很有可能没有正则表达式给我内容。
If I have the source code within a string, how can I turn that string of source code into just the content in C#?
如果我有一个字符串中的源代码,我怎样才能将该源代码字符串转换为 C# 中的内容?
采纳答案by Marc Gravell
It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:
您想要什么并不是 100% 清楚,但我假设您想要文本减去标记;所以:
string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
sb.AppendLine(node.Text);
}
string final = sb.ToString();
回答by Eilon
Please, please do notparse HTML yourself! You cannot use just a standard regex to parse HTML - it's not possible.
,请千万不自己解析HTML!您不能仅使用标准正则表达式来解析 HTML - 这是不可能的。
There are tons of free libraries out there. One of the best free ones in the world of .NET is the HTML Agility Pack.
那里有大量的免费图书馆。HTML Agility Pack是 .NET 世界中最好的免费软件之一。
HTML Agility Pack supports malformed documents as well, which is something that a regex or other basic parsing such as XML will almost never do.
HTML Agility Pack 也支持格式错误的文档,这是正则表达式或其他基本解析(例如 XML)几乎永远不会做的事情。
回答by alin0509
Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source
下面的函数将有助于从 html 字符串中删除所有 HTML 标签、脚本、css、样式并将其转换为纯文本。查看源代码
private string GetPlainTextFromHtml(string htmlString)
{
string htmlTagPattern = "<.*?>";
var regexCss = new Regex("(\<script(.+?)\</script\>)|(\<style(.+?)\</style\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
htmlString = regexCss.Replace(htmlString, string.Empty);
htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
htmlString = htmlString.Replace(" ", string.Empty);
return htmlString;
}
回答by Jonathan Wood
I wrote code to strip out the raw text from markup and present it in my article Convert HTML to Text. The code presented is pretty simple and lightweight.
我编写了代码来从标记中去除原始文本,并将其呈现在我的文章Convert HTML to Text 中。提供的代码非常简单和轻量级。
I also wrote a lightweight HTML parser and have posted it on Github as HTML Monkey. This would be a more complete solution and it would be a simple task to convert the parsed markup to get only the text. I'm still working on this project and am looking for feedback on how it works.
我还编写了一个轻量级的 HTML 解析器,并将其作为HTML Monkey发布在 Github 上。这将是一个更完整的解决方案,并且将解析的标记转换为仅获取文本将是一项简单的任务。我仍在研究这个项目,正在寻找有关它如何工作的反馈。