在 Unix/Linux 中判断两个文件是否具有相同内容的最快方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12900538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest way to tell if two files have the same contents in Unix/Linux?
提问by JDS
I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diff
command seems to be the performance bottleneck.
我有一个 shell 脚本,我需要在其中检查两个文件是否包含相同的数据。我对很多文件都这样做了,在我的脚本中,diff
命令似乎是性能瓶颈。
Here's the line:
这是线路:
diff -q $dst $new > /dev/null
if ($status) then ...
Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff
?
是否有更快的方法来比较文件,也许是自定义算法而不是默认算法diff
?
采纳答案by Alex Howansky
I believe cmp
will stop at the first byte difference:
我相信cmp
会停在第一个字节差异处:
cmp --silent $old $new || echo "files are different"
回答by jabaldonedo
Why don't you get the hash of both files content?
为什么不获取两个文件内容的哈希值?
Try this script, call it for example script.sh and then run it as follows: script.sh file1.txt file2.txt
试试这个脚本,调用它例如 script.sh 然后按如下方式运行它:script.sh file1.txt file2.txt
#!/bin/bash
file1=`md5 `
file2=`md5 `
if [ "$file1" = "$file2" ]
then
echo "Files have the same content"
else
echo "Files have NOT the same content"
fi
回答by jim mcnamara
For files that are not different, any method will require having read both files entirely, even if the read was in the past.
对于没有区别的文件,任何方法都需要完全读取两个文件,即使读取是过去的。
There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.
没有替代。因此,在某个时间点创建哈希或校验和需要读取整个文件。大文件需要时间。
File metadata retrieval is much faster than reading a large file.
文件元数据检索比读取大文件快得多。
So, is there any file metadata you can use to establish that the files are different? File size ? or even results of the file command which does just read a small portion of the file?
那么,是否有任何文件元数据可以用来确定文件是不同的?文件大小 ?甚至只读取文件的一小部分的文件命令的结果?
File size example code fragment:
文件大小示例代码片段:
ls -l |
awk 'NR==1{a=} NR==2{b=}
END{val=(a==b)?0 :1; exit( val) }'
[ $? -eq 0 ] && echo 'same' || echo 'different'
If the files are the same size then you are stuck with full file reads.
如果文件大小相同,则您会遇到完整的文件读取问题。
回答by pn1 dude
I like @Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:
我喜欢@Alex Howansky 为此使用了'cmp --silent'。但我需要正面和负面的回应,所以我使用:
cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'
I can then run this in the terminal or with a ssh to check files against a constant file.
然后我可以在终端或使用 ssh 运行它以根据常量文件检查文件。
回答by Nono Taps
Try also to use the cksum command:
也尝试使用 cksum 命令:
chk1=`cksum <file1> | awk -F" " '{print }'`
chk2=`cksum <file2> | awk -F" " '{print }'`
if [ $chk1 -eq $chk2 ]
then
echo "File is identical"
else
echo "File is not identical"
fi
The cksum command will output the byte count of a file. See 'man cksum'.
cksum 命令将输出文件的字节数。参见“man cksum”。
回答by Hyman Simth
Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:
使用 Raspberry Pi 3B+ 进行一些测试(我使用的是覆盖文件系统,需要定期同步),我对 diff -q 和 cmp -s 进行了比较;请注意,这是来自 /dev/shm 内部的日志,因此磁盘访问速度不是问题:
[root@mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ
real 0m0.008s
user 0m0.008s
sys 0m0.000s
diff false
real 0m0.009s
user 0m0.007s
sys 0m0.001s
cmp false
cp: overwrite atest.copya? y
real 0m0.966s
user 0m0.447s
sys 0m0.518s
diff true
real 0m0.785s
user 0m0.211s
sys 0m0.573s
cmp true
[root@mypi shm]# pico /root/rwbscripts/utils/squish.sh
I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....
我跑了几次。cmp -s 在我使用的测试盒上的时间始终略短。因此,如果您想使用 cmp -s 在两个文件之间执行操作....
identical (){
echo "" and "" are the same.
echo This is a function, you can put whatever you want in here.
}
different () {
echo "" and "" are different.
echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"
回答by Gregory Martin
Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.
因为我很烂而且没有足够的声望点,所以我不能将这个花絮添加为评论。
But, if you are going to use the cmp
command (and don't need/want to be verbose) you can just grab the exit status. Per the cmp
man page:
但是,如果您打算使用该cmp
命令(并且不需要/不想变得冗长),您只需获取退出状态即可。根据cmp
手册页:
If a FILE is '-' or missing, read standard input. Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.
如果 FILE 为“-”或缺失,则读取标准输入。输入相同时退出状态为 0,不同时为 1,故障时为 2。
So, you could do something like:
因此,您可以执行以下操作:
STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)" # "$?" gives exit status for each comparison
if [[$STATUS -ne 0]]; then # if status isn't equal to 0, then execute code
DO A COMMAND ON $FILE1
else
DO SOMETHING ELSE
fi