在 C# 中为大文件创建校验和的最快方法是什么
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1177607/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the fastest way to create a checksum for large files in C#
提问by crono
I have to sync large files across some machines. The files can be up to 6GB in size. The sync will be done manually every few weeks. I cant take the filename into consideration because they can change anytime.
我必须在某些机器上同步大文件。文件最大可达 6GB。同步将每隔几周手动完成。我不能考虑文件名,因为它们可以随时更改。
My plan is to create checksums on the destination PC and on the source PC and then copy all files with a checksum, which are not already in the destination, to the destination. My first attempt was something like this:
我的计划是在目标 PC 和源 PC 上创建校验和,然后将目标中尚未包含的所有带有校验和的文件复制到目标。我的第一次尝试是这样的:
using System.IO;
using System.Security.Cryptography;
private static string GetChecksum(string file)
{
using (FileStream stream = File.OpenRead(file))
{
SHA256Managed sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
The Problem was the runtime:
- with SHA256 with a 1,6 GB File -> 20 minutes
- with MD5 with a 1,6 GB File -> 6.15 minutes
问题是运行时间:
- SHA256 1.6 GB 文件-> 20 分钟
- MD5 1.6 GB 文件-> 6.15 分钟
Is there a better - faster - way to get the checksum (maybe with a better hash function)?
有没有更好 - 更快 - 获得校验和的方法(也许有更好的散列函数)?
采纳答案by Anton Gogolev
The problem here is that SHA256Managed
reads 4096 bytes at a time (inherit from FileStream
and override Read(byte[], int, int)
to see how much it reads from the filestream), which is too small a buffer for disk IO.
这里的问题是一次SHA256Managed
读取 4096 个字节(继承FileStream
并覆盖Read(byte[], int, int)
以查看它从文件流中读取了多少),这对于磁盘 IO 来说是一个缓冲区太小。
To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream
in BufferedStream
and set reasonably-sized buffer size (I tried with ~1 Mb buffer):
为了加快速度(2分钟,我的机器SHA256,1分钟MD5哈希上2 GB的文件)裹FileStream
在BufferedStream
,并设置合理大小的缓冲区大小(我试过〜1 MB缓存):
// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
// The rest remains the same
}
回答by Binary Worrier
Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.
不要对整个文件进行校验和,每 100 mb 左右创建一次校验和,因此每个文件都有一组校验和。
Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.
然后在比较校验和时,您可以在第一个不同的校验和之后停止比较,早点退出,从而避免处理整个文件。
It'll still take the full time for identical files.
对于相同的文件,它仍然需要全部时间。
回答by Christian Birkl
Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)
调用md5sum.exe的 windows 端口。它大约是 .NET 实现的两倍(至少在我使用 1.2 GB 文件的机器上)
public static string Md5SumByProcess(string file) {
var p = new Process ();
p.StartInfo.FileName = "md5sum.exe";
p.StartInfo.Arguments = file;
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.Start();
p.WaitForExit();
string output = p.StandardOutput.ReadToEnd();
return output.Split(' ')[0].Substring(1).ToUpper ();
}
回答by Pasi Savolainen
You're doing something wrong (probably too small read buffer). On a machine of undecent age (Athlon 2x1800MP from 2002) that has DMA on disk probably out of whack (6.6M/s is damn slow when doing sequential reads):
你做错了什么(可能是读取缓冲区太小)。在磁盘上有 DMA 的不正派机器(2002 年的 Athlon 2x1800MP)上可能不正常(在执行顺序读取时 6.6M/s 太慢了):
Create a 1G file with "random" data:
使用“随机”数据创建一个 1G 文件:
# dd if=/dev/sdb of=temp.dat bs=1M count=1024
1073741824 bytes (1.1 GB) copied, 161.698 s, 6.6 MB/s
# time sha1sum -b temp.dat
abb88a0081f5db999d0701de2117d2cb21d192a2 *temp.dat
1m5.299s
1 分 5.299 秒
# time md5sum -b temp.dat
9995e1c1a704f9c1eb6ca11e7ecb7276 *temp.dat
1m58.832s
1m58.832s
This is also weird, md5 is consistently slower than sha1 for me (reran several times).
这也很奇怪,md5 对我来说始终比 sha1 慢(重新运行几次)。
回答by crono
Ok - thanks to all of you - let me wrap this up:
好的 - 感谢你们所有人 - 让我总结一下:
- using a "native" exeto do the hashing took time from 6 Minutes to 10 Seconds which is huge.
- Increasing the bufferwas even faster - 1.6GB file took 5.2 seconds using MD5 in .Net, so I will go with this solution - thanks again
回答by Anders
I did tests with buffer size, running this code
我做了缓冲区大小的测试,运行了这段代码
using (var stream = new BufferedStream(File.OpenRead(file), bufferSize))
{
SHA256Managed sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
}
And I tested with a file of 29? GB in size, the results were
我用 29 的文件进行了测试?大小为 GB,结果为
- 10.000: 369,24s
- 100.000: 362,55s
- 1.000.000: 361,53s
- 10.000.000: 434,15s
- 100.000.000: 435,15s
- 1.000.000.000: 434,31s
- And 376,22s when using the original, none buffered code.
- 10.000: 369,24 秒
- 100.000:362,55 秒
- 1.000.000:361,53 秒
- 10.000.000:434,15 秒
- 100.000.000:435,15s
- 1.000.000.000:434,31s
- 使用原始无缓冲代码时为 376,22s。
I am running an i5 2500K CPU, 12 GB ram and a OCZ Vertex 4 256 GB SSD drive.
我正在运行 i5 2500K CPU、12 GB 内存和 OCZ Vertex 4 256 GB SSD 驱动器。
So I thought, what about a standard 2TB harddrive. And the results were like this
所以我想,标准的 2TB 硬盘怎么样。结果是这样的
- 10.000: 368,52s
- 100.000: 364,15s
- 1.000.000: 363,06s
- 10.000.000: 678,96s
- 100.000.000: 617,89s
- 1.000.000.000: 626,86s
- And for none buffered 368,24
- 10.000: 368,52 秒
- 100.000:364,15 秒
- 1.000.000:363,06 秒
- 10.000.000:678,96 秒
- 100.000.000:617,89 秒
- 1.000.000.000:626,86 秒
- 对于没有缓冲的 368,24
So I would recommend either no buffer or a buffer of max 1 mill.
所以我建议要么不使用缓冲,要么使用最大 1 磨机的缓冲。
回答by Tal Aloni
As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:
正如 Anton Gogolev 所指出的,默认情况下 FileStream 一次读取 4096 个字节,但您可以使用 FileStream 构造函数指定任何其他值:
new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)
Note that Brad Abrams from Microsoft wrote in 2004:
请注意,来自微软的 Brad Abrams 在 2004 年写道:
there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream's buffering logic into FileStream about 4 years ago to encourage better default performance
将 BufferedStream 包装在 FileStream 周围的好处为零。大约 4 年前,我们将 BufferedStream 的缓冲逻辑复制到 FileStream 中,以鼓励更好的默认性能
回答by Romil Kumar Jain
I know that I am late to party but performed test before actually implement the solution.
我知道我迟到了,但在实际实施解决方案之前进行了测试。
I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.
我确实对内置的 MD5 类和md5sum.exe进行了测试。在我的例子中,内置类需要 13 秒,其中 md5sum.exe 在每次运行中也需要大约 16-18 秒。
DateTime current = DateTime.Now;
string file = @"C:\text.iso";//It's 2.5 Gb file
string output;
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(file))
{
byte[] checksum = md5.ComputeHash(stream);
output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
}
}
回答by Fabske
You can have a look to XxHash.Net ( https://github.com/wilhelmliao/xxHash.NET)
The xxHash algorythm seems to be faster than all other.
Some benchmark on the xxHash site : https://github.com/Cyan4973/xxHash
你可以看看 XxHash.Net ( https://github.com/wilhelmliao/xxHash.NET)
xxHash 算法似乎比其他算法都快。
xxHash 网站上的一些基准测试:https: //github.com/Cyan4973/xxHash
PS: I've not yet used it.
PS:我还没用过。