在 VBA 中将 html 转换为纯文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5327512/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 07:23:44  来源:igfitidea点击:

Convert html to plain text in VBA

htmlparsingvbahtml-parsing

提问by Mark

I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out.

我有一个包含 html 单元格的 Excel 工作表。如何将它们批量转换为纯文本?目前有这么多无用的标签和样式。我想从头开始写它,但如果我能把纯文本写出来会容易得多。

I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back.

我可以编写一个脚本来将 html 转换为 PHP 中的纯文本,因此如果您想不出 VBA 中的解决方案,那么也许您可以建议我如何将单元格数据传递到网站并检索数据。

回答by Tim Williams

Set a reference to "Microsoft HTML object library".

设置对“Microsoft HTML 对象库”的引用。

Function HtmlToText(sHTML) As String
  Dim oDoc As HTMLDocument
  Set oDoc = New HTMLDocument
  oDoc.body.innerHTML = sHTML
  HtmlToText = oDoc.body.innerText
End Function

Tim

蒂姆

回答by Todd

A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.

一种非常简单的提取文本的方法是逐个字符扫描HTML,将尖括号外的字符累加成一个新的字符串。

Function StripTags(ByVal html As String) As String
    Dim text As String
    Dim accumulating As Boolean
    Dim n As Integer
    Dim c As String

    text = ""
    accumulating = True

    n = 1
    Do While n <= Len(html)

        c = Mid(html, n, 1)
        If c = "<" Then
            accumulating = False
        ElseIf c = ">" Then
            accumulating = True
        Else
            If accumulating Then
                text = text & c
            End If
        End If

        n = n + 1
    Loop

    StripTags = text
End Function

This can leave lots of extraneous whitespace, but it will help in removing the tags.

这可能会留下很多无关的空白,但它有助于删除标签。

回答by cbaldan

Tim's solution was great, worked liked a charm.

蒂姆的解决方案很棒,很有魅力。

I′d like to contribute: Use this code to add the "Microsoft HTML Object Library" in runtime:

我想贡献:使用此代码在运行时添加“Microsoft HTML 对象库”:

Set ID = ThisWorkbook.VBProject.References
ID.AddFromGuid "{3050F1C5-98B5-11CF-BB82-00AA00BDCE0B}", 2, 5

It worked on Windows XP and Windows 7.

它适用于 Windows XP 和 Windows 7。

回答by Gardoglee

Tim's answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.

蒂姆的回答非常好。但是,可以添加微小的调整以避免出现可预见的错误响应。

 Function HtmlToText(sHTML) As String
      Dim oDoc As HTMLDocument

      If IsNull(sHTML) Then
        HtmlToText = ""
        Exit Function
        End-If

      Set oDoc = New HTMLDocument
      oDoc.body.innerHTML = sHTML
      HtmlToText = oDoc.body.innerText
    End Function

回答by Ben

Here's a variation of Tim's and Gardoglee's solution that does not require setting a reference to "Microsoft HTML object library". This method is known as Late Bindingand will also work in vbscript.

这是 Tim 和 Gardoglee 解决方案的变体,不需要设置对“Microsoft HTML 对象库”的引用。这种方法称为后期绑定,也适用于 vbscript。

Function HtmlToText(sHTML) As String

    Dim oDoc As Object ' As HTMLDocument

    If IsNull(sHTML) Then
        HtmlToText = ""
        Exit Function
    End If

    Set oDoc = CreateObject("HTMLFILE")
    oDoc.body.innerHTML = sHTML
    HtmlToText = oDoc.body.innerText

End Function

Note that if you are using VBA in Access2007 or greater, there is an Application.PlainText()method built-in that does the same thing as the code above.

请注意,如果您在Access2007 或更高版本中使用 VBA ,则内置Application.PlainText()方法与上面的代码执行相同的操作。

回答by ofundefined

Yes! I managed to solve my problem as well. Thanks everybody/

是的!我也设法解决了我的问题。谢谢大家/

In my case, I had this sort of input:

就我而言,我有这样的输入:

<p>Lorem ipsum dolor sit amet.</p>

<p>Ut enim ad minim veniam.</p>

<p>Duis aute irure dolor in reprehenderit.</p>

And I did not want the result to be all jammed together without breaklines.

而且我不希望结果在没有断裂线的情况下全部挤在一起。

So I first splitted my input for every <p>tag into an array 'paragraphs', then for each element I used Tim's answer to get the text out of html (very sweet answer btw).

因此,我首先将每个<p>标签的输入拆分为一个数组“段落”,然后对于每个元素,我使用 Tim 的答案从 html 中获取文本(顺便说一句,非常甜蜜的答案)。

In addition I concatenated each cleaned 'paragraph' with this breakline character Crh(10)for VBA/Excel.

此外,我将每个清理过的“段落”与这个Crh(10)用于 VBA/Excel 的分隔线字符连接起来。

The final code is:

最后的代码是:

Public Function HtmlToText(ByVal sHTML As String) As String
    Dim oDoc As HTMLDocument
    Dim result As String
    Dim paragraphs() As String

    If IsNull(sHTML) Then
      HtmlToText = ""
      Exit Function
    End If

    result = ""
    paragraphs = Split(sHTML, "<p>")

    For Each paragraph In paragraphs
        Set oDoc = New HTMLDocument
        oDoc.body.innerHTML = paragraph
        result = result & Chr(10) & Chr(10) & oDoc.body.innerText
    Next paragraph

    HtmlToText = result
End Function