Node.js 上的 HTML 解析器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7977945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 11:20:36  来源:igfitidea点击:

HTML-parser on Node.js

htmlparsingnode.jsnokogiri

提问by asci

Is there something like Ruby's nokogirion nodejs? I mean a user-friendly HTML-parser.

nodejs上有没有像 Ruby 的nokogiri 之类的东西?我的意思是一个用户友好的 HTML 解析器。

I'd seen on Node.js modules page some parsers, but I can't find something pretty and fresh.

我在 Node.js 模块页面上看到了一些解析器,但我找不到漂亮和新鲜的东西。

回答by Farid Nouri Neshat

If you want to build DOMyou can use jsdom.

如果你想构建DOM,你可以使用jsdom

There's also cheerio, it has the jQueryinterface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.

还有cheerio,它有jQuery界面,比旧版本的jsdom 快很多,尽管现在它们在性能上很相似。

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

你可能想看看htmlparser2,它是一个流解析器,根据它的基准,它似乎比其他人更快,默认情况下没有 DOM。它还可以生成 DOM,因为它还与创建 DOM 的处理程序捆绑在一起。这是cheerio 使用的解析器。

parse5also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.

parse5看起来也是一个不错的解决方案。它相当活跃(自上次更新以来的 11 天),符合 WHATWG,并用于jsdomAngularPolymer

And if you want to parse HTML for web scraping, you can use YQL1. There is a node modulefor it. YQL I think would be the best solution if your HTML is from a staticwebsite, since you are relying on a service, not your own code and processing power. Though note that it won't work if the page is disallowed by the robot.txt of the website, YQL won't work with it.

如果您想为网页抓取解析 HTML ,您可以使用YQL 1。有一个节点模块。如果您的 HTML 来自静态网站,我认为 YQL 将是最佳解决方案,因为您依赖于服务,而不是您自己的代码和处理能力。尽管请注意,如果网站的 robots.txt 不允许该页面将无法使用,但 YQL 将无法使用它。

If the website you're trying to scrape is dynamicthen you should be using a headless browserlike phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

如果您尝试抓取的网站是动态的,那么您应该使用无头浏览器,phantomjs。也看看casperjs,如果你正在考虑phantomjs。您可以使用 SpookyJS 从节点控制casperjs

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

除了 phantomjs 之外,还有zombiejs。与无法嵌入到 nodejs 中的 phantomjs 不同,zombiejs 只是一个 node 模块。

There's a nettuts+ toturialfor the latter solutions.

后一种解决方案有一个nettuts+ toturial



1Since Aug. 2014, YUI library, which is a requirement for YQL, is no longer actively maintained, source

1自 2014 年 8 月起,YQL 所需的 YUI 库不再积极维护,来源

回答by thejh

Try https://github.com/tmpvar/jsdom- you give it some HTML and it gives you a DOM.

试试https://github.com/tmpvar/jsdom- 你给它一些 HTML,它给你一个 DOM。

回答by png

You can also take a look at x-ray: https://github.com/lapwinglabs/x-ray

你也可以看看x-ray:https: //github.com/lapwinglabs/x-ray