Html 如何从shell脚本中的html表中提取数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6854586/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 09:46:40  来源:igfitidea点击:

How to extract data from html table in shell script?

htmlregexshellsedhtml-parsing

提问by Marko

I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data:

我正在尝试创建一个 BASH 脚本,它可以从 HTML 表中提取数据。以下是我需要从中提取数据的表格示例:

<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr>
</table>

And I want the BASH script to output it like so:

我希望 BASH 脚本像这样输出它:

SAVE_DOCUMENT OK 0.475 s
GET_DOCUMENT OK 0.345 s
DVK_SEND OK 0.002 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 4.465 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.002 s
SUMMARY_STATUS OK 5.294 s

How to do it?

怎么做?

So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using grep "<tr><td>, so only lines starting with <tr><td>will be selected for next parsing (sed). This is what I used: sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'But then <tr>tags still remain and also it wont separate the strings. In other words the result of this script is:

到目前为止,我已经尝试使用 sed,但我不知道如何很好地使用它。我使用 grep 排除了表的标题(组件、状态、时间/错误)grep "<tr><td>,因此只有以 开头的行<tr><td>才会被选择用于下一次解析(sed)。这就是我使用的:sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'但是<tr>标签仍然存在并且它也不会分开字符串。换句话说,这个脚本的结果是:

<tr>SAVE_DOCUMENTOK0.406 s</tr>

The full command of the script I'm working on is:

我正在处理的脚本的完整命令是:

cat $FILENAME | grep "<tr><td>" | sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</>@@g'

回答by Zsolt Botykai

Go with (g)awk, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.

去吧(g)awk,它有能力:-),这是一个解决方案,但请注意:它只适用于您发布的确切 html 表格格式。

 awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print , ,  }' FILE

Here you can see it in action: https://ideone.com/zGfLe

在这里你可以看到它的实际效果:https: //ideone.com/zGfLe

Some explanation:

一些解释:

  1. -Fsets the input field separator to a regexp (any of tr's or td's opening or closing tag

  2. then works only on lines that matches those tags AND at least two upercasse fields

  3. then prints the needed fields.

  1. -F将输入字段分隔符设置为正则表达式(任何tr's 或td's 的开始或结束标记

  2. 然后仅适用于匹配这些标签和至少两个大写字段的行

  3. 然后打印所需的字段。

HTH

HTH

回答by Emiliano Poggi

You can use bash xpath(XML::XPathperl module) to accomplish that task very easily:

您可以使用 bash xpath( XML::XPathperl 模块) 非常轻松地完成该任务:

xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g'

回答by kenorb

You may use html2textcommand and format the columns via column, e.g.:

您可以使用html2text命令并通过 格式化列column,例如:

$ html2text table.html | column -ts'|'

Component                                      Status  Time / Error
SAVE_DOCUMENT                                           OK            0.406 s     
GET_DOCUMENT                                            OK            0.332 s     
DVK_SEND                                                OK            0.001 s     
DVK_RECEIVE                                             OK            0.001 s     
GET_USER_INFO                                           OK            0.143 s     
NOTIFICATIONS                                           OK            0.001 s     
ERROR_LOG                                               OK            0.001 s     
SUMMARY_STATUS                                          OK            0.888 s     

then parse it further from there (e.g. cut, awk, ex).

然后从那里进一步分析它(例如cutawkex)。

In case you'd like to sort it first, you can use ex, see the example hereor here.

如果您想先对其进行排序,可以使用ex,请参阅此处此处的示例。

回答by mu is too short

There are a lot of ways of doing this but here's one:

有很多方法可以做到这一点,但这里有一个:

grep '^<tr><td>' < $FILENAME \
| sed \
    -e 's:<tr>::g'  \
    -e 's:</tr>::g' \
    -e 's:</td>::g' \
    -e 's:<td>: :g' \
| cut -c2-

You could use more sed(1)(-e 's:^ ::') instead of the cut -c2-to remove the leading space but cut(1)doesn't get as much love as it deserves. And the backslashes are just there for formatting, you can remove them to get a one liner or leave them in and make sure that they're immediately followed by a newline.

您可以使用更多sed(1)( -e 's:^ ::') 而不是 thecut -c2-来删除前导空格,但cut(1)并没有得到应有的爱。反斜杠只是用于格式化,您可以删除它们以获得单行或保留它们并确保它们后面紧跟换行符。

The basic strategy is to slowly pull the HTML apart piece by piece rather than trying to do it all at once with a single incomprehensible pile of regex syntax.

基本策略是慢慢地将 HTML 一块一块地分开,而不是试图用一堆难以理解的正则表达式语法一次性完成。

Parsing HTML with a shell pipeline isn't the best idea ever but you can do it if the HTML is known to come in a very specific format. If there will be variation then you'd be better with with a real HTML parser in Perl, Ruby, Python, or even C.

使用 shell 管道解析 HTML 并不是最好的主意,但如果已知 HTML 以非常特定的格式出现,您就可以这样做。如果会有变化,那么最好使用 Perl、Ruby、Python 甚至 C 中的真正 HTML 解析器。

回答by mklement0

A solution based on multi-platform web-scraping CLI xideland XQuery:

基于多平台网页抓取 CLIxidelXQuery 的解决方案:

xidel -s --xquery 'for $tr in //tr[position()>1] return join($tr/td, " ")' file

With the sample input, this yields:

使用样本输入,这会产生:

SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s

Explanation:

解释:

  • XQuery query for $tr in //tr[position()>1] return join($tr/td, " ")processes the trelements starting with the 2nd one (position()>1, to skip the header row) in a loop, and joins the values of the child tdelements ($tr/td) with a single space as the separator.

  • -smakes xidelsilent (suppresses output of status information).

  • XQuery 查询循环for $tr in //tr[position()>1] return join($tr/td, " ")处理tr从第二个元素(position()>1,跳过标题行)开始的元素,并使用单个空格作为分隔符连接子td元素 ( $tr/td)的值。

  • -s使xidel静音(抑制状态信息的输出)。



While html2textis convenient for displayof the extracted data, providing machine-parseable output is non-trivial, unfortunately:

虽然html2text方便显示提取的数据,但提供机器可解析的输出并非易事,不幸的是:

html2text file | awk -F' *\|' 'NR>2 {gsub(/^\||.\b/, ""); =; print}'

The Awk command removes the hidden \b-based (backspace-based) sequences that html2textoutputs by default, and parses the lines into fields by |, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use -v OFS='\t').

awk命令去掉默认输出的hidden \b-based(backspace-based)序列,html2text将行解析成字段by |,然后输出,以空格为分隔符(空格是awk默认的输出字段分隔符;改到选项卡,例如,使用-v OFS='\t')。

Note: Use of -nobsto suppress backspace sequences at the source is notan option, because you then won't be able to distinguish between the hidden-by-default _instances used for padding and actual _characters in the data.

注意:使用 of-nobs在源处抑制退格序列不是一种选择,因为这样您将无法区分_用于填充的默认隐藏实例和_数据中的实际字符。

Note: Given that html2textseemingly invariably uses |as the column separator, the above will only work robustly if the are no |instances in the databeing extracted.

注意:鉴于html2text似乎总是|用作列分隔符,只有|在被提取的数据中没有实例时,上述内容才会有效

回答by kenorb

You can parse the file using Ex editor(part of Vim) by removing HTML tags, e.g.:

您可以使用Ex 编辑器(Vim 的一部分)通过删除 HTML 标签来解析文件,例如:

$ ex -s +'%s/<[^>]\+>/ /g' +'v/0/d' +'wq! /dev/stdout' table.html 
  SAVE_DOCUMENT  OK  0.406 s  
  GET_DOCUMENT  OK  0.332 s  
  DVK_SEND  OK  0.001 s  
  DVK_RECEIVE  OK  0.001 s  
  GET_USER_INFO  OK  0.143 s  
  NOTIFICATIONS  OK  0.001 s  
  ERROR_LOG  OK  0.001 s  
  SUMMARY_STATUS  OK  0.888 s 

Here is shorter version by printing the whole file without HTML tags:

这是通过打印没有 HTML 标签的整个文件的较短版本:

$ ex +'%s/<[^>]\+>/ /g|%p' -scq! table.html

Explanation:

解释:

  • %s/<[^>]\+>/ /g- Substitute all HTML tags into empty space.
  • v/0/d- Deletes all lines without 0.
  • wq! /dev/stdout- Quits editor and writes the buffer to the standard output.
  • %s/<[^>]\+>/ /g-小号ubstitute所有的HTML标记为空的空间。
  • v/0/d- deletes没有所有行0
  • wq! /dev/stdout- QUITS编辑和w ^仪式缓冲区到标准输出。