Html 正则表达式选择标签之间的所有文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7167279/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 10:12:30  来源:igfitidea点击:

Regex select all text between tags

htmlregexhtml-parsing

提问by basheps

What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page.

选择 2 个标签之间的所有文本的最佳方法是什么 - 例如:页面上所有“pre”标签之间的文本。

回答by PyKing

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

您可以使用"<pre>(.*?)</pre>", (用您想要的任何文本替换 pre )并提取第一组(对于更具体的说明,请指定一种语言),但这假定您拥有非常简单且有效的 HTML 的简单概念。

As other commenters have suggested, if you're doing something complex, use a HTML parser.

正如其他评论者所建议的那样,如果您正在做一些复杂的事情,请使用 HTML 解析器。

回答by zac

Tag can be completed in another line. This is why \nneeds to be added.

标签可以在另一行完成。这就是为什么\n需要添加。

<PRE>(.|\n)*?<\/PRE>

回答by DevWL

This is what I would use.

这就是我会使用的。

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/?°?!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

基本上它的作用是:

(?<=(<pre>))Selection have to be prepend with <pre>tag

(?<=(<pre>))选择必须预先加上<pre>标签

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/?°?!?{}|~]| )This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character |simply means "OR".

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/?°?!?{}|~]| )这只是我想应用的正则表达式。在这种情况下,它选择字母或数字或换行符或方括号中示例中列出的一些特殊字符。管道字符|仅表示“”。

+?Plus character states to select one or more of the above - order does not matter. Question markchanges the default behavior from 'greedy' to 'ungreedy'.

+?加上字符状态选择上述一项或多项 - 顺序无关紧要。问号将默认行为从“贪婪”更改为“不贪婪”。

(?=(</pre>))Selection have to be appended by the </pre>tag

(?=(</pre>))选择必须由</pre>标签附加

enter image description here

在此处输入图片说明

Depending on your use case you might need to add some modifiers like (ior m)

根据您的用例,您可能需要添加一些修饰符,例如(im

  • i- case-insensitive
  • m- multi-line search
  • i- 不区分大小写
  • m- 多行搜索

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

在这里,我在 Sublime Text 中执行了这个搜索,所以我不必在我的正则表达式中使用修饰符。

Javascript does not support lookbehind

Javascript 不支持后视

The above example should work fine with languages such as PHP, Perl, Java ... Javascript, however, does not support lookbehind so we have to forget about using (?<=(<pre>))and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here Regex match text between tags

上面的例子应该适用于 PHP、Perl、Java 等语言……然而,Javascript 不支持后视,所以我们不得不忘记使用(?<=(<pre>))并寻找某种解决方法。也许简单地从我们的每个选择的结果中去除前四个字符,就像这里的正则 表达式匹配标签之间的文本

Also look at the JAVASCRIPT REGEX DOCUMENTATIONfor non-capturing parentheses

另请参阅JAVASCRIPT REGEX DOCUMENTATION以获取非捕获括号

回答by Shravan Ramamurthy

use the below pattern to get content between element. Replace [tag]with the actual element you wish to extract the content from.

使用以下模式获取元素之间的内容。替换[tag]为您希望从中提取内容的实际元素。

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchortag having href, then use the below pattern.

有时标签会有属性,比如anchor标签具有href,然后使用以下模式。

 <[tag][^>]*>(.+?)</[tag]>

回答by Jean-Simon Collard

To exclude the delimiting tags:

要排除分隔标签:

(?<=<pre>)(.*?)(?=</pre>)

(?<=<pre>)looks for text after <pre>

(?<=<pre>)在之后查找文本 <pre>

(?=</pre>)looks for text before </pre>

(?=</pre>)之前查找文本 </pre>

Results will text inside pretag

结果将文本内pre标签

回答by sg3s

You shouldn't be trying to parse html with regexes see this questionand how it turned out.

您不应该尝试使用正则表达式解析 html,请参阅此问题以及结果如何。

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

用最简单的术语来说,html 不是正则语言,因此您无法完全解析正则表达式。

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

话虽如此,当没有嵌套的类似标签时,您可以解析 html 的子集。因此,只要介于 和 之间的任何内容都不是该标签本身,这将起作用:

preg_match("/<([\w]+)[^>]*>(.*?)<\/>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

一个更好的主意是使用解析器(如原生 DOMDocument)来加载您的 html,然后选择您的标签并获取可能如下所示的内部 html:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

由于这是一个合适的解析器,它将能够处理嵌套标签等。

回答by Heriberto Rivera

Try this....

尝试这个....

(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)

回答by maqduni

This seems to be the simplest regular expression of all that I found

这似乎是我发现的最简单的正则表达式

(?:<TAG>)([\s\S]*)(?:<\/TAG>)
  1. Exclude opening tag (?:<TAG>)from the matches
  2. Include any whitespace or non-whitespace characters ([\s\S]*)in the matches
  3. Exclude closing tag (?:<\/TAG>)from the matches
  1. (?:<TAG>)从匹配中排除开始标签
  2. ([\s\S]*)在匹配项中包含任何空格或非空格字符
  3. (?:<\/TAG>)从匹配中排除结束标记

回答by Clarius

This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.

这个答案假设支持环顾四周!这使我能够识别成对的开始和结束标签之间的所有文本。这就是“>”和“<”之间的所有文本。它起作用是因为环顾四周不会消耗它匹配的字符。

(?<=>)([\w\s]+)(?=</)

(?<=>)([\w\s]+)(?=</)

I tested it in https://regex101.com/using this HTML fragment.

我使用此 HTML 片段在https://regex101.com/ 中对其进行了测试。

<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>

It's a game of three parts: the look behind, the content, and the look ahead.

这是一个由三部分组成的游戏:回顾、内容和展望。

(?<=>)    # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/)   # look ahead  (but don't consume/capture) for a '</'

screen capture from regex101.com

来自 regex101.com 的屏幕截图

I hope that serves as a started for 10. Luck.

我希望这可以作为 10 的开始。运气。

回答by Shishir Arora

var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
    str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });

Since accepted answer is without javascript code, so adding that:

由于接受的答案没有 javascript 代码,因此添加: