C# 如何使用 .NET 快速比较 2 个文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1358510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to compare 2 files fast using .NET?
提问by TheFlash
Typical approachesrecommend reading the binary via FileStream and comparing it byte-by-byte.
典型的方法建议通过 FileStream 读取二进制文件并逐字节比较。
- Would a checksum comparison such as CRC be faster?
- Are there any .NET libraries that can generate a checksum for a file?
- CRC 等校验和比较会更快吗?
- 是否有任何 .NET 库可以为文件生成校验和?
采纳答案by Reed Copsey
A checksum comparison will most likely be slower than a byte-by-byte comparison.
校验和比较很可能比逐字节比较慢。
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
为了生成校验和,您需要加载文件的每个字节,并对其进行处理。然后,您必须对第二个文件执行此操作。处理几乎肯定会比比较检查慢。
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksumwith C#.
至于生成校验和:您可以使用密码学类轻松完成此操作。这是使用 C#生成 MD5 校验和的简短示例。
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
但是,如果您可以预先计算“测试”或“基本”情况的校验和,则校验和可能会更快并且更有意义。如果您有一个现有文件,并且您正在检查新文件是否与现有文件相同,则预先计算“现有”文件的校验和意味着只需要执行一次 DiskIO,在新文件。这可能比逐字节比较要快。
回答by Sam Harwell
Edit:This method would notwork for comparing binary files!
编辑:此方法不适用于比较二进制文件!
In .NET 4.0, the File
class has the following two new methods:
在 .NET 4.0 中,File
该类具有以下两个新方法:
public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)
Which means you could use:
这意味着您可以使用:
bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
回答by Cecil Has a Name
If the files are not too big, you can use:
如果文件不是太大,您可以使用:
public static byte[] ComputeFileHash(string fileName)
{
using (var stream = File.OpenRead(fileName))
return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}
It will only be feasible to compare hashes if the hashes are useful to store.
如果哈希值对存储有用,那么比较哈希值才是可行的。
(Edited the code to something much cleaner.)
(将代码编辑为更清晰的内容。)
回答by dtb
In addition to Reed Copsey's answer:
除了Reed Copsey的回答:
The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.
If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.
最坏的情况是两个文件是相同的。在这种情况下,最好逐字节比较文件。
如果这两个文件不相同,您可以通过更快地检测它们不相同来加快速度。
For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.
例如,如果两个文件的长度不同,那么您就知道它们不可能相同,您甚至不必比较它们的实际内容。
回答by Guffa
The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.
唯一可能使校验和比较比逐字节比较稍快的事实是您一次读取一个文件,这在某种程度上减少了磁盘磁头的寻道时间。然而,这种微小的收益很可能会被计算哈希的额外时间所消耗。
Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.
此外,如果文件相同,校验和比较当然只有更快的机会。如果不是,逐字节比较将在第一个差异处结束,使其速度更快。
You should also consider that a hash code comparison only tells you that it's very likelythat the files are identical. To be 100% certain you need to do a byte-by-byte comparison.
您还应该考虑到哈希码比较只会告诉您这些文件很可能是相同的。要 100% 确定,您需要进行逐字节比较。
If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.
例如,如果哈希码是 32 位,那么如果哈希码匹配,您大约有 99.99999998% 的确定文件是相同的。这接近 100%,但如果您真的需要 100% 的确定性,那就不是了。
回答by chsh
The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.
最慢的方法是逐字节比较两个文件。我能想到的最快的是类似的比较,但不是一次一个字节,而是使用大小为 Int64 的字节数组,然后比较结果数字。
Here's what I came up with:
这是我想出的:
const int BYTES_TO_READ = sizeof(Int64);
static bool FilesAreEqual(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
byte[] one = new byte[BYTES_TO_READ];
byte[] two = new byte[BYTES_TO_READ];
for (int i = 0; i < iterations; i++)
{
fs1.Read(one, 0, BYTES_TO_READ);
fs2.Read(two, 0, BYTES_TO_READ);
if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
return false;
}
}
return true;
}
In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.
在我的测试中,我能够看到这比简单的 ReadByte() 方案的性能高出近 3:1。平均超过 1000 次运行,我在 1063 毫秒时得到了这个方法,在 3031 毫秒时得到了下面的方法(直接逐字节比较)。散列总是在平均 865 毫秒左右返回亚秒级。此测试使用约 100MB 的视频文件。
Here's the ReadByte and hashing methods I used, for comparison purposes:
这是我使用的 ReadByte 和散列方法,用于比较目的:
static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
for (int i = 0; i < first.Length; i++)
{
if (fs1.ReadByte() != fs2.ReadByte())
return false;
}
}
return true;
}
static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
{
byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());
for (int i=0; i<firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i])
return false;
}
return true;
}
回答by Lars
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
如果您不以 8 字节的小块读取而是循环读取更大的块,它会变得更快。我将平均比较时间减少到 1/4。
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
bool result;
if (fileInfo1.Length != fileInfo2.Length)
{
result = false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
result = StreamsContentsAreEqual(file1, file2);
}
}
}
return result;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
回答by Thomas Kj?rnes
Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.
对具有相同长度的大文件的另一个改进可能是不按顺序读取文件,而是比较或多或少的随机块。
You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.
您可以使用多个线程,从文件中的不同位置开始并向前或向后比较。
This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.
通过这种方式,您可以检测文件中间/末尾的更改,比使用顺序方法更快。
回答by romeok
My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.
我的实验表明,减少调用 Stream.ReadByte() 的次数肯定有帮助,但使用 BitConverter 打包字节与比较字节数组中的字节没有太大区别。
So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:
因此,可以用最简单的循环替换上面注释中的“Math.Ceiling 和迭代”循环:
for (int i = 0; i < count1; i++)
{
if (buffer1[i] != buffer2[i])
return false;
}
I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.
我想这与 BitConverter.ToInt64 在比较之前需要做一些工作(检查参数然后执行位移)有关,最终与比较两个数组中的 8 个字节的工作量相同.
回答by CAFxX
If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)
如果您只需要比较两个文件,我想最快的方法是(在 C 中,我不知道它是否适用于 .NET)
- open both files f1, f2
- get the respective file length l1, l2
- if l1 != l2 the files are different; stop
- mmap() both files
- use memcmp() on the mmap()ed files
- 打开两个文件 f1, f2
- 获取各自的文件长度 l1, l2
- 如果 l1 != l2 文件不同;停止
- mmap() 两个文件
- 在 mmap()ed 文件上使用 memcmp()
OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.
OTOH,如果你需要在一组N个文件中查找是否有重复文件,那么最快的方法无疑是使用哈希来避免N路逐位比较。