Html 获取所有节点的 XPATH

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5643323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 07:56:51  来源:igfitidea点击:

get XPATH for all the nodes

htmlparsingxpath

提问by user583726

Is there a library that can give me the XPATH for all the nodes in an HTML page?

是否有一个库可以为我提供 HTML 页面中所有节点的 XPATH?

回答by Dimitre Novatchev

is there any library that can give me XPATH for all the nodes in HTML page

是否有任何库可以为 HTML 页面中的所有节点提供 XPATH

Yes, if this HTML page is a well-formed XML document.

是的,如果此 HTML 页面是格式良好的 XML 文档

Depending on what you understand by "node"...

取决于您对“节点”的理解...

//*

selects all the elements in the document.

选择文档中的所有元素。

/descendant-or-self::node()

selects all elements, text nodes, processing instructions, comment nodes, and the root node /.

选择所有元素、文本节点、处理指令、注释节点和根节点/

//text()

selects all text nodes in the document.

选择文档中的所有文本节点。

//comment()

selects all comment nodes in the document.

选择文档中的所有注释节点。

//processing-instruction()

selects all processing instructions in the document.

选择文档中的所有处理指令。

//@* 

selects all attribute nodes in the document.

选择文档中的所有属性节点。

//namespace::*

selects all namespace nodes in the document.

选择文档中的所有命名空间节点。

Finally, you can combine any of the above expressions using the union (|) operator.

最后,您可以使用 union ( |) 运算符组合上述任何表达式。

Thus, I believe that the following expression really selects "all the nodes" of any XML document:

因此,我相信以下表达式确实选择了任何 XML 文档的“所有节点”:

/descendant-or-self::node() | //@* | //namespace::*

回答by tegan

In case this is helpful for someone else, if you're using python/lxml, you'll first need to have a tree, and then query that tree with the XPATH paths that Dimitre lists above.

如果这对其他人有帮助,如果您使用的是 python/lxml,您首先需要有一棵树,然后使用 Dimitre 上面列出的 XPATH 路径查询该树。

To get the tree:

获取树:

import lxml
from lxml import html, etree

your_webpage_string = "<html><head><title>test<body><h1>page title</h3>"
bad_html = lxml.html.fromstring(your_webpage_string)
good_html = etree.tostring(root, pretty_print=True).strip()
your_tree = etree.fromstring(good_html)
all_xpaths = your_tree.xpath('//*') 

On the last line, replace '//*' with whatever xpath you want. all_xpathsis now a list which looks like this:

在最后一行,用你想要的任何 xpath 替换 '//*' 。all_xpaths现在是一个看起来像这样的列表:

[<Element html at 0x7ff740b24b90>,
 <Element head at 0x7ff740b24d88>,
 <Element title at 0x7ff740b24dd0>,
 <Element body at 0x7ff740b24e18>,
 <Element h1 at 0x7ff740b24e60>]