在 Unix/Linux 中判断两个文件是否具有相同内容的最快方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12900538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 14:36:24  来源:igfitidea点击:

Fastest way to tell if two files have the same contents in Unix/Linux?

linuxfileunixdiff

提问by JDS

I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diffcommand seems to be the performance bottleneck.

我有一个 shell 脚本,我需要在其中检查两个文件是否包含相同的数据。我对很多文件都这样做了,在我的脚本中,diff命令似乎是性能瓶颈。

Here's the line:

这是线路:

diff -q $dst $new > /dev/null

if ($status) then ...

Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff?

是否有更快的方法来比较文件,也许是自定义算法而不是默认算法diff

采纳答案by Alex Howansky

I believe cmpwill stop at the first byte difference:

我相信cmp会停在第一个字节差异处:

cmp --silent $old $new || echo "files are different"

回答by jabaldonedo

Why don't you get the hash of both files content?

为什么不获取两个文件内容的哈希值?

Try this script, call it for example script.sh and then run it as follows: script.sh file1.txt file2.txt

试试这个脚本,调用它例如 script.sh 然后按如下方式运行它:script.sh file1.txt file2.txt

#!/bin/bash

file1=`md5 `
file2=`md5 `

if [ "$file1" = "$file2" ]
then
    echo "Files have the same content"
else
    echo "Files have NOT the same content"
fi

回答by jim mcnamara

For files that are not different, any method will require having read both files entirely, even if the read was in the past.

对于没有区别的文件,任何方法都需要完全读取两个文件,即使读取是过去的。

There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.

没有替代。因此,在某个时间点创建哈希或校验和需要读取整个文件。大文件需要时间。

File metadata retrieval is much faster than reading a large file.

文件元数据检索比读取大文件快得多。

So, is there any file metadata you can use to establish that the files are different? File size ? or even results of the file command which does just read a small portion of the file?

那么,是否有任何文件元数据可以用来确定文件是不同的?文件大小 ?甚至只读取文件的一小部分的文件命令的结果?

File size example code fragment:

文件大小示例代码片段:

  ls -l   | 
  awk 'NR==1{a=} NR==2{b=} 
       END{val=(a==b)?0 :1; exit( val) }'

[ $? -eq 0 ] && echo 'same' || echo 'different'  

If the files are the same size then you are stuck with full file reads.

如果文件大小相同,则您会遇到完整的文件读取问题。

回答by pn1 dude

I like @Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:

我喜欢@Alex Howansky 为此使用了'cmp --silent'。但我需要正面和负面的回应,所以我使用:

cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'

I can then run this in the terminal or with a ssh to check files against a constant file.

然后我可以在终端或使用 ssh 运行它以根据常量文件检查文件。

回答by Nono Taps

Try also to use the cksum command:

也尝试使用 cksum 命令:

chk1=`cksum <file1> | awk -F" " '{print }'`
chk2=`cksum <file2> | awk -F" " '{print }'`

if [ $chk1 -eq $chk2 ]
then
  echo "File is identical"
else
  echo "File is not identical"
fi

The cksum command will output the byte count of a file. See 'man cksum'.

cksum 命令将输出文件的字节数。参见“man cksum”。

回答by Hyman Simth

Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:

使用 Raspberry Pi 3B+ 进行一些测试(我使用的是覆盖文件系统,需要定期同步),我对 diff -q 和 cmp -s 进行了比较;请注意,这是来自 /dev/shm 内部的日志,因此磁盘访问速度不是问题:

[root@mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ

real    0m0.008s
user    0m0.008s
sys     0m0.000s
diff false

real    0m0.009s
user    0m0.007s
sys     0m0.001s
cmp false
cp: overwrite atest.copya? y

real    0m0.966s
user    0m0.447s
sys     0m0.518s
diff true

real    0m0.785s
user    0m0.211s
sys     0m0.573s
cmp true
[root@mypi shm]# pico /root/rwbscripts/utils/squish.sh

I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....

我跑了几次。cmp -s 在我使用的测试盒上的时间始终略短。因此,如果您想使用 cmp -s 在两个文件之间执行操作....

identical (){
  echo "" and "" are the same.
  echo This is a function, you can put whatever you want in here.
}
different () {
  echo "" and "" are different.
  echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"

回答by Gregory Martin

Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.

因为我很烂而且没有足够的声望点,所以我不能将这个花絮添加为评论。

But, if you are going to use the cmpcommand (and don't need/want to be verbose) you can just grab the exit status. Per the cmpman page:

但是,如果您打算使用该cmp命令(并且不需要/不想变得冗长),您只需获取退出状态即可。根据cmp手册页:

If a FILE is '-' or missing, read standard input. Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.

如果 FILE 为“-”或缺失,则读取标准输入。输入相同时退出状态为 0,不同时为 1,故障时为 2。

So, you could do something like:

因此,您可以执行以下操作:

STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)"  # "$?" gives exit status for each comparison

if [[$STATUS -ne 0]]; then  # if status isn't equal to 0, then execute code
    DO A COMMAND ON $FILE1
else
    DO SOMETHING ELSE
fi