使用 C# 解析 HTML 以获取内容

Question

提问by Mike B

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfectly human-readable.

我正在编写一个应用程序来抓取我的一组网页。我不想获取页面的整个源代码，而是想获取所有内容并存储它，并且能够将页面作为纯文本存储在数据库中。内容将在其他应用程序中使用，用户不会阅读，因此不需要完全可读。

At first, I was thinking of using regular expressions, but I have no control over the validity of the web pages and there is a great chance that no regular expression would give me the content.

起初，我想使用正则表达式，但我无法控制网页的有效性，并且很有可能没有正则表达式给我内容。

If I have the source code within a string, how can I turn that string of source code into just the content in C#?

如果我有一个字符串中的源代码，我怎样才能将该源代码字符串转换为 C# 中的内容？

Answer 1

采纳答案by Marc Gravell

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

您想要什么并不是 100% 清楚，但我假设您想要文本减去标记；所以：

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
    html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
    sb.AppendLine(node.Text);
}
string final = sb.ToString();

Answer 2

回答by Eilon

Please, please do notparse HTML yourself! You cannot use just a standard regex to parse HTML - it's not possible.

，请千万不自己解析HTML！您不能仅使用标准正则表达式来解析 HTML - 这是不可能的。

There are tons of free libraries out there. One of the best free ones in the world of .NET is the HTML Agility Pack.

那里有大量的免费图书馆。HTML Agility Pack是 .NET 世界中最好的免费软件之一。

HTML Agility Pack supports malformed documents as well, which is something that a regex or other basic parsing such as XML will almost never do.

HTML Agility Pack 也支持格式错误的文档，这是正则表达式或其他基本解析（例如 XML）几乎永远不会做的事情。

Answer 3

回答by alin0509

Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source

下面的函数将有助于从 html 字符串中删除所有 HTML 标签、脚本、css、样式并将其转换为纯文本。查看源代码

private string GetPlainTextFromHtml(string htmlString)
{
    string htmlTagPattern = "<.*?>";
    var regexCss = new Regex("(\<script(.+?)\</script\>)|(\<style(.+?)\</style\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    htmlString = regexCss.Replace(htmlString, string.Empty);
    htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
    htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
    htmlString = htmlString.Replace("&nbsp;", string.Empty);

    return htmlString;
}

Answer 4

回答by Jonathan Wood

I wrote code to strip out the raw text from markup and present it in my article Convert HTML to Text. The code presented is pretty simple and lightweight.

我编写了代码来从标记中去除原始文本，并将其呈现在我的文章Convert HTML to Text 中。提供的代码非常简单和轻量级。

I also wrote a lightweight HTML parser and have posted it on Github as HTML Monkey. This would be a more complete solution and it would be a simple task to convert the parsed markup to get only the text. I'm still working on this project and am looking for feedback on how it works.

我还编写了一个轻量级的 HTML 解析器，并将其作为HTML Monkey发布在 Github 上。这将是一个更完整的解决方案，并且将解析的标记转换为仅获取文本将是一项简单的任务。我仍在研究这个项目，正在寻找有关它如何工作的反馈。

使用 C# 解析 HTML 以获取内容

提问by Mike B

采纳答案by Marc Gravell

回答by Eilon

回答by alin0509

回答by Jonathan Wood

相关推荐

最近更新

标签

使用 C# 解析 HTML 以获取内容

提问by Mike B

采纳答案by Marc Gravell

回答by Eilon

回答by alin0509

回答by Jonathan Wood

相关推荐

C# 打印一个数组的内容（代码为一行，用于visual studio的立即窗口）

Eclipse：如何在 Eclipse (Indigo) 上安装 Subclipse [在 Linux 上]

Linux “未定义对‘pow’的引用”即使使用 math.h 和库链接 -lm

C# 只读计算属性，它们应该是方法吗？

相关推荐

最近更新

标签