Html Golang解析HTML,提取所有带有<body> </body>标签的内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30109061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 11:27:35  来源:igfitidea点击:

Golang parse HTML, extract all content with <body> </body> tags

htmlgo

提问by user2737876

As stated in the title. I am needing to return all of the content within the body tags of an html document, including any subsequent html tags, etc. Im curious to know what the best way to go about this is. I had a working solution with the Gokogiri package, however I am trying to stay away from any packages that depend on C libraries. Is there a way to accomplish this with the go standard library? or with a package that is 100% go?

如标题所述。我需要返回 html 文档的 body 标签内的所有内容,包括任何后续的 html 标签等。我很想知道什么是最好的方法。我有一个 Gokogiri 包的工作解决方案,但是我试图远离任何依赖于 C 库的包。有没有办法用 go 标准库来完成这个?还是使用 100% 的包裹?

Since posting my original question I have attempted to use the following packages that have yielded no resolution. (Neither of which seem to return subsequent children or nested tags from inside the body. For example:

自从发布我的原始问题以来,我尝试使用以下没有解决问题的软件包。(这两个似乎都没有从正文内部返回后续子项或嵌套标签。例如:

<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html> 

will return body content, ignoring the subsequent <p>tags and the text they wrap):

将返回正文内容,忽略后续<p>标签和它们包装的文本):

  • pkg/encoding/xml/ (standard library xml package)
  • golang.org/x/net/html
  • pkg/encoding/xml/(标准库xml包)
  • golang.org/x/net/html

The over all goal would be to obtain a string or content that would look like:

总体目标是获得如下所示的字符串或内容:

<body>
    body content 
    <p>more content</p>
</body>

回答by Joachim Birche

This can be solved by recursively finding the body node, using the html package, and subsequently render the html, starting from that node.

这可以通过递归查找 body 节点,使用 html 包,然后从该节点开始渲染 html 来解决。

package main

import (
    "bytes"
    "errors"
    "fmt"
    "golang.org/x/net/html"
    "io"
    "strings"
)

func Body(doc *html.Node) (*html.Node, error) {
    var body *html.Node
    var crawler func(*html.Node)
    crawler = func(node *html.Node) {
        if node.Type == html.ElementNode && node.Data == "body" {
            body = node
            return
        }
        for child := node.FirstChild; child != nil; child = child.NextSibling {
            crawler(child)
        }
    }
    crawler(doc)
    if body != nil {
        return body, nil
    }
    return nil, errors.New("Missing <body> in the node tree")
}

func renderNode(n *html.Node) string {
    var buf bytes.Buffer
    w := io.Writer(&buf)
    html.Render(w, n)
    return buf.String()
}

func main() {
    doc, _ := html.Parse(strings.NewReader(htm))
    bn, err := Body(doc)
    if err != nil {
        return
    }
    body := renderNode(bn)
    fmt.Println(body)
}

const htm = `<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    body content
    <p>more content</p>
</body>
</html>`

回答by fredrik

It can be done using the standard encoding/xmlpackage. But it's a bit cumbersome. And one caveat in this example is that it will not include the enclosing body tag, but it will contain all of it's children.

它可以使用标准encoding/xml包来完成。但是有点麻烦。这个例子中的一个警告是它不会包含封闭的 body 标签,但它会包含它的所有子标签。

package main

import (
    "bytes"
    "encoding/xml"
    "fmt"
)

type html struct {
    Body body `xml:"body"`
}
type body struct {
    Content string `xml:",innerxml"`
}

func main() {
    b := []byte(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)

    h := html{}
    err := xml.NewDecoder(bytes.NewBuffer(b)).Decode(&h)
    if err != nil {
        fmt.Println("error", err)
        return
    }

    fmt.Println(h.Body.Content)
}

Runnable example:
http://play.golang.org/p/ZH5iKyjRQp

可运行示例:http:
//play.golang.org/p/ZH5iKyjRQp

回答by andybalholm

Since you didn't show the source code of your attempt with the html package, I'll have to guess what you were doing, but I suspect you were using the tokenizer rather than the parser. Here is a program that uses the parser and does what you were looking for:

由于您没有使用 html 包显示您尝试的源代码,我将不得不猜测您在做什么,但我怀疑您使用的是标记器而不是解析器。这是一个使用解析器并执行您要查找的操作的程序:

package main

import (
    "log"
    "os"
    "strings"

    "github.com/andybalholm/cascadia"
    "golang.org/x/net/html"
)

func main() {
    r := strings.NewReader(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)
    doc, err := html.Parse(r)
    if err != nil {
        log.Fatal(err)
    }

    body := cascadia.MustCompile("body").MatchFirst(doc)
    html.Render(os.Stdout, body)
}

回答by Caleb

You could also do this purely with strings:

你也可以纯粹用字符串来做到这一点:

func main() {
    r := strings.NewReader(`
<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content
        <p>more content</p>
    </body>
</html>
`)
    str := NewSkipTillReader(r, []byte("<body>"))
    rtr := NewReadTillReader(str, []byte("</body>"))
    bs, err := ioutil.ReadAll(rtr)
    fmt.Println(string(bs), err)
}

The definitions for SkipTillReaderand ReadTillReaderare here: https://play.golang.org/p/6THLhRgLOa. (But basically skip until you see the delimiter and then read until you see the delimiter)

对于定义SkipTillReaderReadTillReader在这里:https://play.golang.org/p/6THLhRgLOa。(但基本上跳过直到看到分隔符然后阅读直到看到分隔符)

This won't work for case insensitivity (though that wouldn't be hard to change).

这不适用于不区分大小写的情况(尽管这并不难改变)。