Html 正则表达式选择标签之间的所有文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7167279/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex select all text between tags
提问by basheps
What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page.
选择 2 个标签之间的所有文本的最佳方法是什么 - 例如:页面上所有“pre”标签之间的文本。
回答by PyKing
You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
您可以使用"<pre>(.*?)</pre>"
, (用您想要的任何文本替换 pre )并提取第一组(对于更具体的说明,请指定一种语言),但这假定您拥有非常简单且有效的 HTML 的简单概念。
As other commenters have suggested, if you're doing something complex, use a HTML parser.
正如其他评论者所建议的那样,如果您正在做一些复杂的事情,请使用 HTML 解析器。
回答by zac
Tag can be completed in another line. This is why \n
needs to be added.
标签可以在另一行完成。这就是为什么\n
需要添加。
<PRE>(.|\n)*?<\/PRE>
回答by DevWL
This is what I would use.
这就是我会使用的。
(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/?°?!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
基本上它的作用是:
(?<=(<pre>))
Selection have to be prepend with <pre>
tag
(?<=(<pre>))
选择必须预先加上<pre>
标签
(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/?°?!?{}|~]| )
This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character |
simply means "OR".
(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/?°?!?{}|~]| )
这只是我想应用的正则表达式。在这种情况下,它选择字母或数字或换行符或方括号中示例中列出的一些特殊字符。管道字符|
仅表示“或”。
+?
Plus character states to select one or more of the above - order does not matter. Question markchanges the default behavior from 'greedy' to 'ungreedy'.
+?
加上字符状态选择上述一项或多项 - 顺序无关紧要。问号将默认行为从“贪婪”更改为“不贪婪”。
(?=(</pre>))
Selection have to be appended by the </pre>
tag
(?=(</pre>))
选择必须由</pre>
标签附加
Depending on your use case you might need to add some modifiers like (ior m)
根据您的用例,您可能需要添加一些修饰符,例如(i或m)
- i- case-insensitive
- m- multi-line search
- i- 不区分大小写
- m- 多行搜索
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
在这里,我在 Sublime Text 中执行了这个搜索,所以我不必在我的正则表达式中使用修饰符。
Javascript does not support lookbehind
Javascript 不支持后视
The above example should work fine with languages such as PHP, Perl, Java ...
Javascript, however, does not support lookbehind so we have to forget about using (?<=(<pre>))
and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here
Regex match text between tags
上面的例子应该适用于 PHP、Perl、Java 等语言……然而,Javascript 不支持后视,所以我们不得不忘记使用(?<=(<pre>))
并寻找某种解决方法。也许简单地从我们的每个选择的结果中去除前四个字符,就像这里的正则
表达式匹配标签之间的文本
Also look at the JAVASCRIPT REGEX DOCUMENTATIONfor non-capturing parentheses
另请参阅JAVASCRIPT REGEX DOCUMENTATION以获取非捕获括号
回答by Shravan Ramamurthy
use the below pattern to get content between element. Replace [tag]
with the actual element you wish to extract the content from.
使用以下模式获取元素之间的内容。替换[tag]
为您希望从中提取内容的实际元素。
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor
tag having href
, then use the below pattern.
有时标签会有属性,比如anchor
标签具有href
,然后使用以下模式。
<[tag][^>]*>(.+?)</[tag]>
回答by Jean-Simon Collard
To exclude the delimiting tags:
要排除分隔标签:
(?<=<pre>)(.*?)(?=</pre>)
(?<=<pre>)
looks for text after <pre>
(?<=<pre>)
在之后查找文本 <pre>
(?=</pre>)
looks for text before </pre>
(?=</pre>)
之前查找文本 </pre>
Results will text inside pre
tag
结果将文本内pre
标签
回答by sg3s
You shouldn't be trying to parse html with regexes see this questionand how it turned out.
您不应该尝试使用正则表达式解析 html,请参阅此问题以及结果如何。
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
用最简单的术语来说,html 不是正则语言,因此您无法完全解析正则表达式。
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:
话虽如此,当没有嵌套的类似标签时,您可以解析 html 的子集。因此,只要介于 和 之间的任何内容都不是该标签本身,这将起作用:
preg_match("/<([\w]+)[^>]*>(.*?)<\/>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
一个更好的主意是使用解析器(如原生 DOMDocument)来加载您的 html,然后选择您的标签并获取可能如下所示的内部 html:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.
由于这是一个合适的解析器,它将能够处理嵌套标签等。
回答by Heriberto Rivera
Try this....
尝试这个....
(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)
回答by maqduni
This seems to be the simplest regular expression of all that I found
这似乎是我发现的最简单的正则表达式
(?:<TAG>)([\s\S]*)(?:<\/TAG>)
- Exclude opening tag
(?:<TAG>)
from the matches - Include any whitespace or non-whitespace characters
([\s\S]*)
in the matches - Exclude closing tag
(?:<\/TAG>)
from the matches
(?:<TAG>)
从匹配中排除开始标签([\s\S]*)
在匹配项中包含任何空格或非空格字符(?:<\/TAG>)
从匹配中排除结束标记
回答by Clarius
This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.
这个答案假设支持环顾四周!这使我能够识别成对的开始和结束标签之间的所有文本。这就是“>”和“<”之间的所有文本。它起作用是因为环顾四周不会消耗它匹配的字符。
(?<=>)([\w\s]+)(?=</)
(?<=>)([\w\s]+)(?=</)
I tested it in https://regex101.com/using this HTML fragment.
我使用此 HTML 片段在https://regex101.com/ 中对其进行了测试。
<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>
It's a game of three parts: the look behind, the content, and the look ahead.
这是一个由三部分组成的游戏:回顾、内容和展望。
(?<=>) # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/) # look ahead (but don't consume/capture) for a '</'
I hope that serves as a started for 10. Luck.
我希望这可以作为 10 的开始。运气。
回答by Shishir Arora
var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });
Since accepted answer is without javascript code, so adding that:
由于接受的答案没有 javascript 代码,因此添加: