Linux grep 从 tar.gz 不解压 [更快]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13983365/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
grep from tar.gz without extracting [faster one]
提问by Pixel
Am trying to grep pattern from dozen files .tar.gz but its very slow
我正在尝试从十几个文件 .tar.gz 中提取模式,但速度很慢
am using
正在使用
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
回答by nemo
For starters, you could start more than one process:
首先,您可以启动多个进程:
tar -ztf file.tar.gz | while read FILENAME
do
(if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
then
echo "$FILENAME contains string"
fi) &
done
The ( ... ) &
creates a new detached (read: the parent shell does not wait for the child)
process.
将( ... ) &
创建一个新的分离(读:父进程不等待子)进程。
After that, you should optimize the extracting of your archive. The read is no problem, as the OS should have cached the file access already. However, tar needs to unpack the archive every time the loop runs, which can be slow. Unpacking the archive once and iterating over the result may help here:
之后,您应该优化存档的提取。读取没有问题,因为操作系统应该已经缓存了文件访问。但是,每次循环运行时,tar 都需要解压缩存档,这可能会很慢。解压存档一次并迭代结果可能会有所帮助:
local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
(if grep -l "string" "$FILENAME"
then
echo "$FILENAME contains string"
fi) &
done && rm -r $tempPath
find
is used here, to get a list of files in the target directory of tar
, which we're iterating over, for each file searching for a string.
find
在这里使用tar
,为每个搜索字符串的文件获取目标目录中的文件列表,我们正在迭代它。
Edit:Use grep -l
to speed up things, as Jim pointed out. From man grep
:
编辑:使用grep -l
以加快东西,吉姆指出。来自man grep
:
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would
normally have been printed. The scanning will stop on the first match. (-l is specified
by POSIX.)
回答by Jim Stewart
If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar
's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.
如果这真的很慢,我怀疑您正在处理一个大型存档文件。它将解压缩一次以提取文件列表,然后对 grep 解压缩 N 次——其中 N 是存档中的文件数。除了所有的解压缩之外,每次提取每个文件时都必须将相当多的内容扫描到存档中。其中一个tar
最大的缺点是,有没有在一开始目录。没有有效的方法来获取存档中所有文件的信息并只读取文件的那部分。它基本上必须读取所有文件,直到您每次提取的内容为止;它不能立即跳转到文件名的位置。
The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz
) and then work on the .tar
file. That might help enough by itself. It's still going to loop through the entire archive N times, though.
要加快速度,您可以做的最简单的事情是先解压缩文件 ( gunzip file.tar.gz
),然后处理该.tar
文件。这本身可能有足够的帮助。尽管如此,它仍然会遍历整个存档 N 次。
If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:
如果您真的希望这样做有效,您唯一的选择是在处理之前完全提取存档中的所有内容。由于您的问题是速度,我怀疑这是一个您不想先提取的大文件,但如果可以,这将大大加快速度:
tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
grep -l "string" $f
done
Note that grep -l
prints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -l
will help. If the files are huge, it will help a lot.
请注意,grep -l
打印任何匹配文件的名称,在第一次匹配后退出,如果没有匹配则静音。仅此一项就可以加速命令的 grepping 部分,因此即使您没有空间提取整个存档,grep -l
也会有所帮助。如果文件很大,它将有很大帮助。
回答by Jester
You can use the --to-command
option to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual.
Armed with the above information, you could try something like:
您可以使用该--to-command
选项将文件通过管道传输到任意脚本。使用它,您可以一次处理存档(并且没有临时文件)。另请参阅此问题和手册。有了上述信息,您可以尝试以下操作:
$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc
回答by lanes
If you have zgrep
you can use
如果你有zgrep
你可以使用
zgrep -a string file.tar.gz
回答by Katie
I know this question is 4 years old, but I have a couple different options:
我知道这个问题已经有 4 年历史了,但我有几个不同的选择:
Option 1: Using tar --to-command grep
选项 1:使用 tar --to-command grep
The following line will look in example.tgz
for PATTERN
. This is similar to @Jester's example, but I couldn't get his pattern matching to work.
下面的行应该在example.tgz
的PATTERN
。这类似于@Jester 的示例,但我无法让他的模式匹配起作用。
tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'
Option 2: Using tar -tzf
选项 2:使用 tar -tzf
The second option is using tar -tzf
to list the files, then go through them with grep
. You can create a function to use it over and over:
第二个选项是使用tar -tzf
列出文件,然后使用grep
. 您可以创建一个函数来反复使用它:
targrep () {
for i in $(tar -tzf ""); do
results=$(tar -Oxzf "" "$i" | grep --label="$i" -H "")
echo "$results"
done
}
Usage:
用法:
targrep example.tar.gz "pattern"
回答by John T.
All of the code above was really helpful, but none of it quite answered my own need: grep
all *.tar.gz
files in the current directory to find a pattern that is specified as an argument in a reusable script to output:
上面的所有代码都非常有帮助,但它们都没有完全满足我自己的需求:当前目录中的grep
所有*.tar.gz
文件都可以找到一个模式,该模式在可重用脚本中指定为参数以输出:
- The name of both the archive file and the extracted file
- The line number where the pattern was found
- The contents of the matching line
- 存档文件和提取文件的名称
- 找到模式的行号
- 匹配行的内容
It's what I was really hoping that zgrep
could do for me and it just can't.
这是我真正希望zgrep
可以为我做的事情,但它做不到。
Here's my solution:
这是我的解决方案:
pattern=
for f in *.tar.gz; do
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done
You can also replace the tar
line with the following if you'd like to test that all variables are expanding properly with a basic echo
statement:
tar
如果您想使用基本echo
语句测试所有变量是否正确扩展,您也可以用以下内容替换该行:
tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""
Let me explain what's going on. Hopefully, the for
loop and the echo
of the archive filename in question is obvious.
让我解释一下发生了什么。希望for
有echo
问题的存档文件名的循环和是显而易见的。
tar -xzf
: x
extract, z
filter through gzip, f
based on the following archive file...
tar -xzf
:x
提取,z
通过gzip过滤,f
基于以下存档文件...
"$f"
: The archive file provided by the for loop (such as what you'd get by doing an ls
) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.
"$f"
:由 for 循环提供的存档文件(例如您通过执行 获得的文件ls
)用双引号括起来,以允许变量展开并确保脚本不会被任何带空格的文件名等破坏。
--to-command
: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep
) and what arguments we're passing to that command.
--to-command
: 将 tar 命令的输出传递给另一个命令,而不是实际将文件提取到文件系统。这之后的所有内容都指定了命令是什么 ( grep
) 以及我们传递给该命令的参数。
Let's break that part down by itself, since it's the "secret sauce" here.
让我们自己分解那部分,因为它是这里的“秘方”。
'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME
) is notimmediately expanded/resolved. More on that in a moment.
首先,我们使用单引号来启动这个块,这样执行的子命令 ( basename $TAR_FILENAME
)不会立即展开/解析。稍后会详细介绍。
grep
: The command to be run on the (not actually) extracted files
grep
:要在(实际上不是)提取的文件上运行的命令
--label=
: The label to prepend the results, the value of which is enclosed in double-quotes because we dowant to have the grep
command resolve the $TAR_FILENAME
environment variable passed in by the tar
command.
--label=
: 将结果放在前面的标签,其值用双引号括起来,因为我们确实希望grep
命令解析命令$TAR_FILENAME
传入的环境变量tar
。
basename $TAR_FILENAME
: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file
basename $TAR_FILENAME
: 作为命令运行(由反引号包围)并删除目录路径并仅输出文件名
-Hin
: H
Display filename (provided by the label), i
Case insensitive search, n
Display line number of match
-Hin
:H
显示文件名(由标签提供),i
不区分大小写搜索,n
显示匹配行号
Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern
, passed in as the first argument, can be resolved.
然后我们用单引号“结束”命令字符串的第一部分,并用双引号开始下一部分,以便$pattern
可以解析作为第一个参数传入的 。
Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)
意识到我需要使用哪些引语,哪里是让我绊倒最长的部分。希望这一切对您有意义并帮助其他人。另外,我希望我能在一年后再次需要它时找到它(我已经忘记了我为它制作的脚本!)
And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.
自从我写完上面的内容已经有几个星期了,它仍然非常有用……但它不够好,因为文件堆积如山,搜索变得更加混乱。我需要一种方法来限制我按文件日期查看的内容(只查看最近的文件)。所以这是代码。希望它是不言自明的。
if [ -z "" ]; then
echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
echo "Usage: targrep <string to search for> [start date]"
fi
pattern=
startdatein=
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
done
And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.
我不能停止调整这件事。我添加了一个参数以按 tar 文件中的输出文件的名称进行过滤。通配符也有效。
Usage:
用法:
targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>
targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>
Example:
例子:
targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford
targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford
while getopts "d:f:" opt; do
case $opt in
d) startdatein=$OPTARG;;
f) targetfile=$OPTARG;;
esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=
echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
echo "in filenames: $targetfile"
fi
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
if [[ -z "$targetfile" ]]; then
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
else
tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
fi
done
回答by Nutan
Both the below options work well.
以下两个选项都运行良好。
$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
回答by Dr. Alex RE
Am trying to grep pattern from dozen files .tar.gz but its very slow
tar -ztf file.tar.gz | while read FILENAME do if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null then echo "$FILENAME contains string" fi done
我正在尝试从十几个文件 .tar.gz 中提取模式,但速度很慢
tar -ztf file.tar.gz | while read FILENAME do if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null then echo "$FILENAME contains string" fi done
That's actually very easy with ugrepoption -z
:
使用ugrep选项实际上很容易-z
:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
Which requires just one command to search file.tar.gz
as follows:
只需要一个命令即可搜索file.tar.gz
,如下所示:
ugrep -z "string" file.tar.gz
This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:
这会搜索每个存档文件以显示匹配项。存档文件名显示在大括号中,以区别于普通文件名。例如:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l
(--files-with-matches
) and customize the filename output with option --format="%z%~"
to get rid of the braces:
如果您只想要文件名,请使用选项-l
( --files-with-matches
) 并使用选项自定义文件名输出以--format="%z%~"
摆脱大括号:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt