C# 如何从合理理智的 HTML 中提取文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2113651/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 23:37:36  来源:igfitidea点击:

How to extract text from resonably sane HTML?

c#htmldtext-extraction

提问by BCS

My question is sort of like this questionbut I have more constraints:

我的问题有点像这个问题,但我有更多的限制:

  • I know the document's are reasonably sane
  • they are very regular (they all came from the same source
  • I want about 99% of the visible text
  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
  • I don't care about formatting or even paragraph breaks.
  • 我知道文件是合理的
  • 他们非常有规律(他们都来自同一个来源
  • 我想要大约 99% 的可见文本
  • 大约 99% 的可行内容是文本(它们或多或少是 RTF 转换为 HTML)
  • 我不关心格式甚至段落中断。

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

是否有任何工具可以做到这一点,还是我最好只打破 RegexBuddy 和 C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.

我对命令行或批处理工具以及 C/C#/D 库持开放态度。

采纳答案by SLaks

You need to use the HTML Agility Pack.

您需要使用HTML Agility Pack

You probably want to find an element using LINQ ant the Descendantscall, then get its InnerText.

您可能想使用 LINQ antDescendants调用找到一个元素,然后获取它的InnerText.

回答by AlishahNovin

It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.

如果将 HTML 加载到 C# 中,然后在 C#/WinForms 中使用 mshtml.dll 或 WebBrowser 控件,则相对简单,然后可以将整个 HTML 文档视为一棵树,遍历该树捕获 InnerText 对象。

Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.

或者,您也可以使用 document.all,它获取树,将其展平,然后您可以遍历树,再次捕获 InnerText。

Here's an example:

下面是一个例子:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

Hope that helps!

希望有帮助!

回答by herzmeister

Here you can download a tool and its source that converts to and fro HTML and XAML: XAML/HTML converter.

在这里您可以下载一个工具及其在 HTML 和 XAML 之间转换的源代码:XAML/HTML 转换器

It contains a HTML parser (such a thing must obviously be much more tolerant than your standard XML parser) and you can traverse the HTML much similar to XML.

它包含一个 HTML 解析器(这种东西显然必须比您的标准 XML 解析器更宽容)并且您可以遍历与 XML 非常相似的 HTML。

回答by Hugo

From the command line, you can use the Lynxtext browser like this:

从命令行,您可以像这样使用Lynx文本浏览器:

If you want to download a web page in formatted output (i.e., without HTML tags, but instead as it would appear in Lynx), then enter:

如果要下载格式化输出的网页(即,没有 HTML 标记,而是像在 Lynx 中显示的那样),请输入:

lynx -dump URL > filename

If there are any links on the page, the URLs for those links will be included at the end of the downloaded page.

如果页面上有任何链接,这些链接的 URL 将包含在下载页面的末尾。

You can disable the list of linkswith -nolist. For example:

您可以禁用链接列表-nolist。例如:

lynx -dump -nolist http://stackoverflow.com/a/10469619/724176 > filename

回答by Sam Saffron

This code I hacked up today with HTML Agility Pack, will extract unformatted trimmed text.

我今天用HTML Agility Pack编写的这段代码将提取未格式化的修剪文本。

public static string ExtractText(string html)
{
    if (html == null)
    {
        throw new ArgumentNullException("html");
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    var chunks = new List<string>(); 

    foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
    {
        if (item.NodeType == HtmlNodeType.Text)
        {
            if (item.InnerText.Trim() != "")
            {
                chunks.Add(item.InnerText.Trim());
            }
        }
    }
    return String.Join(" ", chunks);
}

If you want to maintain some level of formatting you can build on the sampleprovided with the source.

如果您想保持某种程度的格式设置,您可以在源代码提供的示例的基础上进行构建。

public string Convert(string path)
{
    HtmlDocument doc = new HtmlDocument();
    doc.Load(path);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlNodeType.Document:
            ConvertContentTo(node, outText);
            break;

        case HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode) node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html));
            }
            break;

        case HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
                ConvertContentTo(node, outText);
            }
            break;
    }
}


private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
    foreach (HtmlNode subnode in node.ChildNodes)
    {
        ConvertTo(subnode, outText);
    }
}

回答by Ashraf

Here is the Best way:

这是最好的方法:

  public static string StripHTML(string HTMLText)
    {
        Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return reg.Replace(HTMLText, "");
    }

回答by Paul

Here is the code I am using:

这是我正在使用的代码:

using System.Web;
public static string ExtractText(string html)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    string s =reg.Replace(html, " ");
    s = HttpUtility.HtmlDecode(s);
    return s;
}

回答by xoofx

You can use NUglifythat supports text extraction from HTML:

您可以使用支持从 HTML 中提取文本的NUglify

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser)

因为它使用的是 HTML5 自定义解析器,所以它应该非常健壮(特别是如果文档不包含任何错误)并且速度非常快(不涉及正则表达式,而是一个纯粹的递归下降解析器)

回答by xoofx

Here's a class I developed to accomplish the same thing. All available HTML parsing libraries were far too slow, regex was far too slow as well. Functionality is explained in the code comments. From my benchmarks, this code is a little over 10X faster than HTML Agility Pack's equivalent code when tested on Amazon's landing page (included below).

这是我开发的一个类来完成同样的事情。所有可用的 HTML 解析库都太慢了,正则表达式也太慢了。代码注释中解释了功能。从我的基准测试来看,在亚马逊的登陆页面(包括在下面)上测试时,这段代码比 HTML Agility Pack 的等效代码快 10 倍多一点。

/// <summary>
/// The fast HTML text extractor class is designed to, as quickly and as ignorantly as possible,
/// extract text data from a given HTML character array. The class searches for and deletes
/// script and style tags in a first and second pass, with an optional third pass to do the same
/// to HTML comments, and then copies remaining non-whitespace character data to an ouput array.
/// All whitespace encountered is replaced with a single whitespace in to avoid multiple
/// whitespace in the output.
///
/// Note that the returned text content still may have named character and numbered character
/// references within that, when decoded, may produce multiple whitespace.
/// </summary>
public class FastHtmlTextExtractor
{

    private readonly char[] SCRIPT_OPEN_TAG = new char[7] { '<', 's', 'c', 'r', 'i', 'p', 't' };
    private readonly char[] SCRIPT_CLOSE_TAG = new char[9] { '<', '/', 's', 'c', 'r', 'i', 'p', 't', '>' };

    private readonly char[] STYLE_OPEN_TAG = new char[6] { '<', 's', 't', 'y', 'l', 'e' };
    private readonly char[] STYLE_CLOSE_TAG = new char[8] { '<', '/', 's', 't', 'y', 'l', 'e', '>' };

    private readonly char[] COMMENT_OPEN_TAG = new char[3] { '<', '!', '-' };
    private readonly char[] COMMENT_CLOSE_TAG = new char[3] { '-', '-', '>' };

    private int[] m_deletionDictionary;

    public string Extract(char[] input, bool stripComments = false)
    {
        var len = input.Length;
        int next = 0;

        m_deletionDictionary = new int[len];

        // Whipe out all text content between style and script tags.
        FindAndWipe(SCRIPT_OPEN_TAG, SCRIPT_CLOSE_TAG, input);
        FindAndWipe(STYLE_OPEN_TAG, STYLE_CLOSE_TAG, input);

        if(stripComments)
        {
            // Whipe out everything between HTML comments.
            FindAndWipe(COMMENT_OPEN_TAG, COMMENT_CLOSE_TAG, input);
        }

        // Whipe text between all other tags now.
        while(next < len)
        {
            next = SkipUntil(next, '<', input);

            if(next < len)
            {
                var closeNext = SkipUntil(next, '>', input);

                if(closeNext < len)
                {
                    m_deletionDictionary[next] = (closeNext + 1) - next;
                    WipeRange(next, closeNext + 1, input);
                }

                next = closeNext + 1;
            }
        }

        // Collect all non-whitespace and non-null chars into a new
        // char array. All whitespace characters are skipped and replaced
        // with a single space char. Multiple whitespace is ignored.
        var lastSpace = true;
        var extractedPos = 0;
        var extracted = new char[len];

        for(next = 0; next < len; ++next)
        {
            if(m_deletionDictionary[next] > 0)
            {
                next += m_deletionDictionary[next];
                continue;
            }

            if(char.IsWhiteSpace(input[next]) || input[next] == '
// Where m_whitespaceRegex is a Regex with [\s].
// Where sampleHtmlText is a raw HTML string.

var extractedSampleText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sampleHtmlText);

if(doc != null && doc.DocumentNode != null)
{
    foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    {
        script.Remove();
    }

    foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    {
        style.Remove();
    }

    var allTextNodes = doc.DocumentNode.SelectNodes("//text()");
    if(allTextNodes != null && allTextNodes.Count > 0)
    {
        foreach(HtmlNode node in allTextNodes)
        {
            extractedSampleText.Append(node.InnerText);
        }
    }

    var finalText = m_whitespaceRegex.Replace(extractedSampleText.ToString(), " ");
}
') { if(lastSpace) { continue; } extracted[extractedPos++] = ' '; lastSpace = true; } else { lastSpace = false; extracted[extractedPos++] = input[next]; } } return new string(extracted, 0, extractedPos); } /// <summary> /// Does a search in the input array for the characters in the supplied open and closing tag /// char arrays. Each match where both tag open and tag close are discovered causes the text /// in between the matches to be overwritten by Array.Clear(). /// </summary> /// <param name="openingTag"> /// The opening tag to search for. /// </param> /// <param name="closingTag"> /// The closing tag to search for. /// </param> /// <param name="input"> /// The input to search in. /// </param> private void FindAndWipe(char[] openingTag, char[] closingTag, char[] input) { int len = input.Length; int pos = 0; do { pos = FindNext(pos, openingTag, input); if(pos < len) { var closenext = FindNext(pos, closingTag, input); if(closenext < len) { m_deletionDictionary[pos - openingTag.Length] = closenext - (pos - openingTag.Length); WipeRange(pos - openingTag.Length, closenext, input); } if(closenext > pos) { pos = closenext; } else { ++pos; } } } while(pos < len); } /// <summary> /// Skips as many characters as possible within the input array until the given char is /// found. The position of the first instance of the char is returned, or if not found, a /// position beyond the end of the input array is returned. /// </summary> /// <param name="pos"> /// The starting position to search from within the input array. /// </param> /// <param name="c"> /// The character to find. /// </param> /// <param name="input"> /// The input to search within. /// </param> /// <returns> /// The position of the found character, or an index beyond the end of the input array. /// </returns> private int SkipUntil(int pos, char c, char[] input) { if(pos >= input.Length) { return pos; } do { if(input[pos] == c) { return pos; } ++pos; } while(pos < input.Length); return pos; } /// <summary> /// Clears a given range in the input array. /// </summary> /// <param name="start"> /// The start position from which the array will begin to be cleared. /// </param> /// <param name="end"> /// The end position in the array, the position to clear up-until. /// </param> /// <param name="input"> /// The source array wherin the supplied range will be cleared. /// </param> /// <remarks> /// Note that the second parameter is called end, not lenghth. This parameter is meant to be /// a position in the array, not the amount of entries in the array to clear. /// </remarks> private void WipeRange(int start, int end, char[] input) { Array.Clear(input, start, end - start); } /// <summary> /// Finds the next occurance of the supplied char array within the input array. This search /// ignores whitespace. /// </summary> /// <param name="pos"> /// The position to start searching from. /// </param> /// <param name="what"> /// The sequence of characters to find. /// </param> /// <param name="input"> /// The input array to perform the search on. /// </param> /// <returns> /// The position of the end of the first matching occurance. That is, the returned position /// points to the very end of the search criteria within the input array, not the start. If /// no match could be found, a position beyond the end of the input array will be returned. /// </returns> public int FindNext(int pos, char[] what, char[] input) { do { if(Next(ref pos, what, input)) { return pos; } ++pos; } while(pos < input.Length); return pos; } /// <summary> /// Probes the input array at the given position to determine if the next N characters /// matches the supplied character sequence. This check ignores whitespace. /// </summary> /// <param name="pos"> /// The position at which to check within the input array for a match to the supplied /// character sequence. /// </param> /// <param name="what"> /// The character sequence to attempt to match. Note that whitespace between characters /// within the input array is accebtale. /// </param> /// <param name="input"> /// The input array to check within. /// </param> /// <returns> /// True if the next N characters within the input array matches the supplied search /// character sequence. Returns false otherwise. /// </returns> public bool Next(ref int pos, char[] what, char[] input) { int z = 0; do { if(char.IsWhiteSpace(input[pos]) || input[pos] == '##代码##') { ++pos; continue; } if(input[pos] == what[z]) { ++z; ++pos; continue; } return false; } while(pos < input.Length && z < what.Length); return z == what.Length; } }

Equivalent in HtmlAgilityPack:

等效于 HtmlAgilityPack:

##代码##