Linux awk 模式可以匹配多行吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14350856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can awk patterns match multiple lines?
提问by Andres Gonzalez
I have some complex log files that I need to write some tools to process them. I have been playing with awk but I am not sure if awk is the right tool for this.
我有一些复杂的日志文件,我需要编写一些工具来处理它们。我一直在玩 awk,但我不确定 awk 是否是正确的工具。
My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. I want to process these files and print out only certain lines of the log that pertain to specific pkts. Each pkt log can consist of a varying number of lines for that pkt's entry.
我的日志文件是 OSPF 协议解码的打印输出,其中包含各种协议 pkts 的文本日志及其内容,以及用它们的值标识的各种协议字段。我想处理这些文件并仅打印出与特定 pkts 相关的日志的某些行。每个 pkt 日志可以由该 pkt 条目的不同行数组成。
awk seems to be able to process a single line that matches a pattern. I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out.
awk 似乎能够处理与模式匹配的单行。我可以找到所需的 pkt,但随后我需要匹配后续行中的模式,以确定它是否是我想要打印的 pkt。
Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines.
另一种看待这个问题的方法是,我想隔离日志文件中的几行,并根据几行上的模式匹配打印出作为特定 pkt 详细信息的那些行。
Since awk seems to be line-based, I am not sure if that would be the best tool to use.
由于 awk 似乎是基于行的,我不确定这是否是最好的工具。
If awk can do this, how it is done? If not, any suggestions on which tool to use for this?
如果awk可以做到这一点,它是如何做到的?如果没有,有关为此使用哪种工具的任何建议?
采纳答案by DigitalRoss
Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machinein your code to recognize the sequence.
awk 可以轻松检测模式的多行组合,但您需要在代码中创建所谓的状态机来识别序列。
Consider this input:
考虑这个输入:
how
second half #1
now
first half
second half #2
brown
second half #3
cow
As you have seen, it's easy to recognize a single pattern. Now, we can write an awk program that recognizes second halfonly when it is directly preceded by a first halfline. (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.)
如您所见,识别单一模式很容易。现在,我们可以写识别的awk程序下半场只有当它直接前面有一个上半年线。(使用更复杂的状态机,您可以检测任意模式序列。)
/second half/ {
if(lastLine == "first half") {
print
}
}
{ lastLine = second half #2
}
If you run this you will see:
如果你运行它,你会看到:
`pcregrep -M` works pretty well for this.
Now, this example is absurdly simple and only barely a state machine. The interesting state lasts only for the duration of the ifstatement and the preceding state is implicit, depending on the value of lastLine.In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. But you may not need that much control mechanism.
现在,这个例子非常简单,几乎只是一个状态机。有趣的状态仅在if语句的持续时间内持续,并且前面的状态是隐式的,具体取决于lastLine的值。在更规范的状态机中,您将保留一个明确的状态变量,并根据现有状态和当前输入进行状态到状态的转换。但是您可能不需要那么多控制机制。
回答by Cong Wang
Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www@web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www@web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<[email protected]>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<[email protected]>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<[email protected]>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
From pcregrep(1):
来自 pcregrep(1):
-M, --multiline
Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline characters and internal occurrences of ^ and $ characters. The output for a successful match may consist of more than one line, the last of which is the one in which the match ended. If the matched string ends with a newline sequence the output ends at the end of that line.
When this option is set, the PCRE library is called in “multiline” mode. There is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. However, pcregrep ensures that at least 8K characters or the rest of the document (whichever is the shorter) are available for forward matching, and similarly the previous 8K characters (or all the previous characters, if fewer than 8K) are guaranteed to be available for lookbehind assertions. This option does not work when input is read line by line (see --line-buffered.)
-M, --multiline
允许模式匹配多于一行。当给出这个选项时,模式可能有用地包含文字换行符和 ^ 和 $ 字符的内部出现。成功匹配的输出可能包含多行,最后一行是匹配结束的那一行。如果匹配的字符串以换行序列结尾,则输出在该行的末尾结束。
设置此选项后,PCRE 库将在“多行”模式下调用。可以匹配的行数是有限制的,这是由 pcregrep 在扫描输入文件时缓冲输入文件的方式所强加的。但是,pcregrep 确保至少有 8K 个字符或文档的其余部分(以较短者为准)可用于前向匹配,同样,前 8K 个字符(或所有前一个字符,如果少于 8K)保证可用对于回顾断言。当逐行读取输入时,此选项不起作用(请参阅 --line-buffered。)
回答by ghoti
I do this sort of thing with sendmail logs, from time to time.
我不时用sendmail 日志做这种事情。
Given:
鉴于:
#!/usr/bin/awk -f
BEGIN {
search=ARGV[1]; # Grab the first command line option
delete ARGV[1]; # Delete it so it won't be considered a file
}
# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
line[]=sprintf("%s\n%s", line[], $ mqsearch airtel /var/log/maillog
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<[email protected]>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)
);
}
# Next, keep a record of Queue IDs with substrings that match our search string.
index(animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
, search) {
show[];
}
# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
for(qid in show) {
print line[qid];
}
}
I use a script something like this:
我使用这样的脚本:
$ cat data | sed $'s|^\(animal.*\)|\f\1|'
to get the following output:
得到以下输出:
$ cat data | sed $'s|^\(animal.*\)|\f\1|' | awk '
BEGIN { RS="\f" }
/type: cat/ { print }'
The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract.
这里的想法是我正在打印与我要搜索的字符串的 Sendmail 队列 ID 匹配的所有行。代码的结构当然是日志文件结构的产物,因此您需要为要分析和提取的数据定制解决方案。
回答by Vaz
Awk is really record-based. By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable.
awk 确实是基于记录的。默认情况下,它将一行视为一条记录,但您可以使用 RS(记录分隔符)变量更改它。
One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. Then you can write your awk script where it will treat the group of lines as a single record.
解决此问题的一种方法是使用 sed 进行第一次传递(如果您愿意,也可以使用 awk 执行此操作),用不同的字符(如换页)分隔记录。然后,您可以编写 awk 脚本,它将行组视为单个记录。
For example, if this is your data:
例如,如果这是您的数据:
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat
To separate the records with form-feeds:
要使用换页符分隔记录:
$ cat data | sed $'s|^\(animal.*\)|\f\1|' |
ruby -014 -ne 'print if /type: cat/'
Now we'll take that and pass it through awk. Here's an example of conditionally printing a record:
现在我们将把它通过 awk 传递。这是有条件打印记录的示例:
/start-pattern/,/end-pattern/ {
print
}
outputs:
输出:
* Implements hook_entity_info_alter().
*/
function file_test_entity_type_alter(&$entity_types) {
Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator):
编辑:作为奖励,以下是使用 awk-ward ruby 的方法(-014 表示使用换页符(八进制代码 014)作为记录分隔符):
/\* Implements hook_/,/function / {
print
}
回答by Clemens Tolboom
awk is able to process from start pattern until end pattern
awk 能够处理从开始模式到结束模式
# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
# skip PHP multi line comment end
##代码## ~ / \*\// skip
# Only print 3rd word
if (##代码## ~ /Implements/) {
hook=
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", hook)
print hook
}
# Only print function name without parenthesis
if (##代码## ~ /function/) {
name=
# scrub of opening parenthesis and following.
sub(/\(.*$/, "", name)
print name
print ""
}
}
I was looking for how to match
我正在寻找如何匹配
##代码##so created
如此创造
##代码##which the content I needed. A more complex example is to skip lines and scrub off non-space parts. Note awk is a record(line) and word(split by space) tool.
我需要的内容。一个更复杂的例子是跳过行并擦去非空格部分。注意 awk 是一个记录(行)和单词(按空格分割)工具。
##代码##Hope this helps too.
希望这也有帮助。
See also ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_toc.html
另见ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_toc.html