删除 sed 或类似文件中的 html 标签

Question

提问by user913492

I am trying to fetch contents of table from a wepage. I jsut need the contents but not the tags <tr></tr>. I don't even need "tr" or "td" just the content. for eg:

我正在尝试从网页中获取表格的内容。我只是需要内容而不是标签<tr></tr>。我什至不需要“tr”或“td”只是内容。例如：

<td> I want only this </td>
<tr> and also this </tr>
<TABLE> only texts/numbers in between tags and not the tags. </TABLE>

also I would like to put the first column output like this in a new csv file column1,info1,info2,info3 coumn2,info1,info2,info3

我也想把这样的第一列输出放在一个新的 csv 文件 column1,info1,info2,info3 coumn2,info1,info2,info3

I tried sed to deleted patters <tr><td>but when I fetch table there are also other tags like <color><span>etc. so I want is to delete all the tags; in short everything with < and > .

我尝试使用 sed 删除模式，<tr><td>但是当我获取表时还有其他标签，例如<color><span>等，所以我想要删除所有标签；总之一切都带有 < 和 > 。

Answer 1

回答by Useless Code

sed 's/<[^>]\+>//g'will strip all tags out, but you might want to replace them with a space so tags that are next to each other don't run together: <td>one</td><td>two</td>becoming: onetwo. So you could do sed 's/<[^>]\+>/ /g'so it would output one two(well, actually one two).

sed 's/<[^>]\+>//g'将删除所有标签，但您可能希望用空格替换它们，以便彼此相邻的标签不会一起运行：<td>one</td><td>two</td>成为：onetwo。所以你可以这样做sed 's/<[^>]\+>/ /g'，它会输出one two（嗯，实际上one two）。

That said unless you need just the raw text, and it sounds like you are trying to do some transformations to the data after stripping the tags, a scripting language like Perl might be a more fitting tool to do this stuff with.

也就是说，除非您只需要原始文本，而且听起来您正在尝试在剥离标签后对数据进行一些转换，否则像 Perl 这样的脚本语言可能更适合使用这种工具。

As mu is too short mentioned scraping HTML can be a bit dicey, using something that actually parses the HTML for you would be the best way to do this. PHPs DOM APIis pretty good for these kinds of things.

由于 mu 太短，提到抓取 HTML 可能有点冒险，使用实际为您解析 HTML 的东西将是执行此操作的最佳方法。 PHP 的 DOM API非常适合这类事情。

Answer 2

回答by Robert J

Original:

原来的：

Mac Terminal REGEX behaves a bit differently. I was able to do this on my Mac using the following example:

Mac 终端 REGEX 的行为略有不同。我可以使用以下示例在 Mac 上执行此操作：

$ curl google.com | sed 's/<[^>]*>//g'
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   219  100   219    0     0    385      0 --:--:-- --:--:-- --:--:--   385

301 Moved
301 Moved
The document has moved
here.

$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Edit:

编辑：

Just for clarification sake the origional looked like:

只是为了澄清起见，原始看起来像：

$ curl googl.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

Also the annoying curl header can be rid of using the -s option:

也可以使用 -s 选项摆脱烦人的 curl 标头：

$ curl -s google.com | sed 's/<[^>]*>//g' 

301 Moved
301 Moved
The document has moved
here.

$

删除 sed 或类似文件中的 html 标签

提问by user913492

回答by Useless Code

回答by Robert J

Original:

原来的：

Edit:

编辑：

相关推荐

最近更新

标签

删除 sed 或类似文件中的 html 标签

提问by user913492

回答by Useless Code

回答by Robert J

Original:

原来的：

Edit:

编辑：

相关推荐

为什么 CSS 选择器/HTML 属性首选破折号？

Html 如何使用CSS设置页面图标

Html 如何在div中居中按钮？

Html 悬停时使背景图像变暗

相关推荐

最近更新

标签