Linux grep 从 tar.gz 不解压 [更快]

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13983365/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 18:05:28  来源:igfitidea点击:

grep from tar.gz without extracting [faster one]

linuxbashgrep

提问by Pixel

Am trying to grep pattern from dozen files .tar.gz but its very slow

我正在尝试从十几个文件 .tar.gz 中提取模式,但速度很慢

am using

正在使用

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
        then
                echo "$FILENAME contains string"
        fi
done

回答by nemo

For starters, you could start more than one process:

首先,您可以启动多个进程:

tar -ztf file.tar.gz | while read FILENAME
do
        (if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
        then
                echo "$FILENAME contains string"
        fi) &
done

The ( ... ) &creates a new detached (read: the parent shell does not wait for the child) process.

( ... ) &创建一个新的分离(读:父进程不等待子)进程。

After that, you should optimize the extracting of your archive. The read is no problem, as the OS should have cached the file access already. However, tar needs to unpack the archive every time the loop runs, which can be slow. Unpacking the archive once and iterating over the result may help here:

之后,您应该优化存档的提取。读取没有问题,因为操作系统应该已经缓存了文件访问。但是,每次循环运行时,tar 都需要解压缩存档,这可能会很慢。解压存档一次并迭代结果可能会有所帮助:

local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
        (if grep -l "string" "$FILENAME"
        then
                echo "$FILENAME contains string"
        fi) &
done && rm -r $tempPath

findis used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.

find在这里使用tar,为每个搜索字符串的文件获取目标目录中的文件列表,我们正在迭代它。

Edit:Use grep -lto speed up things, as Jim pointed out. From man grep:

编辑:使用grep -l以加快东西,吉姆指出。来自man grep

   -l, --files-with-matches
          Suppress normal output; instead print the name of each input file from which output would
          normally have been printed.  The scanning will stop on the first match.  (-l is specified
          by POSIX.)

回答by Jim Stewart

If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.

如果这真的很慢,我怀疑您正在处理一个大型存档文件。它将解压缩一次以提取文件列表,然后对 grep 解压缩 N 次——其中 N 是存档中的文件数。除了所有的解压缩之外,每次提取每个文件时都必须将相当多的内容扫描到存档中。其中一个tar最大的缺点是,有没有在一开始目录。没有有效的方法来获取存档中所有文件的信息并只读取文件的那部分。它基本上必须读取所有文件,直到您每次提取的内容为止;它不能立即跳转到文件名的位置。

The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz) and then work on the .tarfile. That might help enough by itself. It's still going to loop through the entire archive N times, though.

要加快速度,您可以做的最简单的事情是先解压缩文件 ( gunzip file.tar.gz),然后处理该.tar文件。这本身可能有足够的帮助。尽管如此,它仍然会遍历整个存档 N 次。

If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:

如果您真的希望这样做有效,您唯一的选择是在处理之前完全提取存档中的所有内容。由于您的问题是速度,我怀疑这是一个您不想先提取的大文件,但如果可以,这将大大加快速度:

tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
  grep -l "string" $f
done

Note that grep -lprints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -lwill help. If the files are huge, it will help a lot.

请注意,grep -l打印任何匹配文件的名称,在第一次匹配后退出,如果没有匹配则静音。仅此一项就可以加速命令的 grepping 部分,因此即使您没有空间提取整个存档,grep -l也会有所帮助。如果文件很大,它将有很大帮助。

回答by Jester

You can use the --to-commandoption to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual. Armed with the above information, you could try something like:

您可以使用该--to-command选项将文件通过管道传输到任意脚本。使用它,您可以一次处理存档(并且没有临时文件)。另请参阅此问题手册。有了上述信息,您可以尝试以下操作:

$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc

回答by lanes

If you have zgrepyou can use

如果你有zgrep你可以使用

zgrep -a string file.tar.gz

回答by Katie

I know this question is 4 years old, but I have a couple different options:

我知道这个问题已经有 4 年历史了,但我有几个不同的选择:

Option 1: Using tar --to-command grep

选项 1:使用 tar --to-command grep

The following line will look in example.tgzfor PATTERN. This is similar to @Jester's example, but I couldn't get his pattern matching to work.

下面的行应该在example.tgzPATTERN。这类似于@Jester 的示例,但我无法让他的模式匹配起作用。

tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'

Option 2: Using tar -tzf

选项 2:使用 tar -tzf

The second option is using tar -tzfto list the files, then go through them with grep. You can create a function to use it over and over:

第二个选项是使用tar -tzf列出文件,然后使用grep. 您可以创建一个函数来反复使用它:

targrep () {
    for i in $(tar -tzf ""); do
        results=$(tar -Oxzf "" "$i" | grep --label="$i" -H "")
        echo "$results"
    done
}

Usage:

用法:

targrep example.tar.gz "pattern"

回答by John T.

All of the code above was really helpful, but none of it quite answered my own need: grepall *.tar.gzfiles in the current directory to find a pattern that is specified as an argument in a reusable script to output:

上面的所有代码都非常有帮助,但它们都没有完全满足我自己的需求:当前目录中的grep所有*.tar.gz文件都可以找到一个模式,该模式在可重用脚本中指定为参数以输出:

  • The name of both the archive file and the extracted file
  • The line number where the pattern was found
  • The contents of the matching line
  • 存档文件和提取文件的名称
  • 找到模式的行号
  • 匹配行的内容

It's what I was really hoping that zgrepcould do for me and it just can't.

这是我真正希望zgrep可以为我做的事情,但它做不到。

Here's my solution:

这是我的解决方案:

pattern=
for f in *.tar.gz; do
     echo "$f:"
     tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done

You can also replace the tarline with the following if you'd like to test that all variables are expanding properly with a basic echostatement:

tar如果您想使用基本echo语句测试所有变量是否正确扩展,您也可以用以下内容替换该行:

tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""

Let me explain what's going on. Hopefully, the forloop and the echoof the archive filename in question is obvious.

让我解释一下发生了什么。希望forecho问题的存档文件名的循环和是显而易见的。

tar -xzf: xextract, zfilter through gzip, fbased on the following archive file...

tar -xzfx提取,z通过gzip过滤,f基于以下存档文件...

"$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.

"$f":由 for 循环提供的存档文件(例如您通过执行 获得的文件ls)用双引号括起来,以允许变量展开并确保脚本不会被任何带空格的文件名等破坏。

--to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.

--to-command: 将 tar 命令的输出传递给另一个命令,而不是实际将文件提取到文件系统。这之后的所有内容都指定了命令是什么 ( grep) 以及我们传递给该命令的参数。

Let's break that part down by itself, since it's the "secret sauce" here.

让我们自己分解那部分,因为它是这里的“秘方”。

'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"

First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is notimmediately expanded/resolved. More on that in a moment.

首先,我们使用单引号来启动这个块,这样执行的子命令 ( basename $TAR_FILENAME)不会立即展开/解析。稍后会详细介绍。

grep: The command to be run on the (not actually) extracted files

grep:要在(实际上不是)提取的文件上运行的命令

--label=: The label to prepend the results, the value of which is enclosed in double-quotes because we dowant to have the grepcommand resolve the $TAR_FILENAMEenvironment variable passed in by the tarcommand.

--label=: 将结果放在前面的标签,其值用双引号括起来,因为我们确实希望grep命令解析命令$TAR_FILENAME传入的环境变量tar

basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file

basename $TAR_FILENAME: 作为命令运行(由反引号包围)并删除目录路径并仅输出文件名

-Hin: HDisplay filename (provided by the label), iCase insensitive search, nDisplay line number of match

-Hin:H显示文件名(由标签提供),i不区分大小写搜索,n显示匹配行号

Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.

然后我们用单引号“结束”命令字符串的第一部分,并用双引号开始下一部分,以便$pattern可以解析作为第一个参数传入的 。

Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)

意识到我需要使用哪些引语,哪里是让我绊倒最长的部分。希望这一切对您有意义并帮助其他人。另外,我希望我能在一年后再次需要它时找到它(我已经忘记了我为它制作的脚本!)



And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.

自从我写完上面的内容已经有几个星期了,它仍然非常有用……但它不够好,因为文件堆积如山,搜索变得更加混乱。我需要一种方法来限制我按文件日期查看的内容(只查看最近的文件)。所以这是代码。希望它是不言自明的。

if [ -z "" ]; then
    echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
    echo "Usage: targrep <string to search for> [start date]"
fi
pattern=
startdatein=
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
    filedate=$(date -r "$f" +%s)
    if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
        echo "$f:"
        tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
    fi
done


And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.

我不能停止调整这件事。我添加了一个参数以按 tar 文件中的输出文件的名称进行过滤。通配符也有效。

Usage:

用法:

targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>

targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>

Example:

例子:

targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford

targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford

while getopts "d:f:" opt; do
    case $opt in
            d) startdatein=$OPTARG;;
            f) targetfile=$OPTARG;;
    esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=

echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
    echo "in filenames:  $targetfile"
fi

startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
    filedate=$(date -r "$f" +%s)
    if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
            echo "$f:"
            if [[ -z "$targetfile" ]]; then
                    tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
            else
                    tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
            fi
    fi
done

回答by Nutan

Both the below options work well.

以下两个选项都运行良好。

$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService  : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html

$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService  : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html

回答by Dr. Alex RE

Am trying to grep pattern from dozen files .tar.gz but its very slow

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
        then
                echo "$FILENAME contains string"
        fi
done

我正在尝试从十几个文件 .tar.gz 中提取模式,但速度很慢

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
        then
                echo "$FILENAME contains string"
        fi
done

That's actually very easy with ugrepoption -z:

使用ugrep选项实际上很容易-z

-z, --decompress
        Decompress files to search, when compressed.  Archives (.cpio,
        .pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
        .tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
        matching pathnames of files in archives are output in braces.  If
        -g, -O, -M, or -t is specified, searches files within archives
        whose name matches globs, matches file name extensions, matches
        file signature magic bytes, or matches file types, respectively.
        Supported compression formats: gzip (.gz), compress (.Z), zip,
        bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
        lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).

Which requires just one command to search file.tar.gzas follows:

只需要一个命令即可搜索file.tar.gz,如下所示:

ugrep -z "string" file.tar.gz

This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:

这会搜索每个存档文件以显示匹配项。存档文件名显示在大括号中,以区别于普通文件名。例如:

$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}:  { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello

If you just want the file names, use option -l(--files-with-matches) and customize the filename output with option --format="%z%~"to get rid of the braces:

如果您只想要文件名,请使用选项-l( --files-with-matches) 并使用选项自定义文件名输出以--format="%z%~"摆脱大括号:

$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt