在 C# 中为大文件创建校验和的最快方法是什么

Question

提问by crono

I have to sync large files across some machines. The files can be up to 6GB in size. The sync will be done manually every few weeks. I cant take the filename into consideration because they can change anytime.

我必须在某些机器上同步大文件。文件最大可达 6GB。同步将每隔几周手动完成。我不能考虑文件名，因为它们可以随时更改。

My plan is to create checksums on the destination PC and on the source PC and then copy all files with a checksum, which are not already in the destination, to the destination. My first attempt was something like this:

我的计划是在目标 PC 和源 PC 上创建校验和，然后将目标中尚未包含的所有带有校验和的文件复制到目标。我的第一次尝试是这样的：

using System.IO;
using System.Security.Cryptography;

private static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        SHA256Managed sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

The Problem was the runtime:
- with SHA256 with a 1,6 GB File -> 20 minutes
- with MD5 with a 1,6 GB File -> 6.15 minutes

问题是运行时间：
- SHA256 1.6 GB 文件-> 20 分钟
- MD5 1.6 GB 文件-> 6.15 分钟

Is there a better - faster - way to get the checksum (maybe with a better hash function)?

有没有更好 - 更快 - 获得校验和的方法（也许有更好的散列函数）？

Answer 1

采纳答案by Anton Gogolev

The problem here is that SHA256Managedreads 4096 bytes at a time (inherit from FileStreamand override Read(byte[], int, int)to see how much it reads from the filestream), which is too small a buffer for disk IO.

这里的问题是一次SHA256Managed读取 4096 个字节（继承FileStream并覆盖Read(byte[], int, int)以查看它从文件流中读取了多少），这对于磁盘 IO 来说是一个缓冲区太小。

To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStreamin BufferedStreamand set reasonably-sized buffer size (I tried with ~1 Mb buffer):

为了加快速度（2分钟，我的机器SHA256，1分钟MD5哈希上2 GB的文件）裹FileStream在BufferedStream，并设置合理大小的缓冲区大小（我试过〜1 MB缓存）：

// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
    // The rest remains the same
}

Answer 2

回答by Binary Worrier

Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.

不要对整个文件进行校验和，每 100 mb 左右创建一次校验和，因此每个文件都有一组校验和。

Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.

然后在比较校验和时，您可以在第一个不同的校验和之后停止比较，早点退出，从而避免处理整个文件。

It'll still take the full time for identical files.

对于相同的文件，它仍然需要全部时间。

Answer 3

回答by Christian Birkl

Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)

调用md5sum.exe的 windows 端口。它大约是 .NET 实现的两倍（至少在我使用 1.2 GB 文件的机器上）

public static string Md5SumByProcess(string file) {
    var p = new Process ();
    p.StartInfo.FileName = "md5sum.exe";
    p.StartInfo.Arguments = file;            
    p.StartInfo.UseShellExecute = false;
    p.StartInfo.RedirectStandardOutput = true;
    p.Start();
    p.WaitForExit();           
    string output = p.StandardOutput.ReadToEnd();
    return output.Split(' ')[0].Substring(1).ToUpper ();
}

Answer 4

回答by Pasi Savolainen

You're doing something wrong (probably too small read buffer). On a machine of undecent age (Athlon 2x1800MP from 2002) that has DMA on disk probably out of whack (6.6M/s is damn slow when doing sequential reads):

你做错了什么（可能是读取缓冲区太小）。在磁盘上有 DMA 的不正派机器（2002 年的 Athlon 2x1800MP）上可能不正常（在执行顺序读取时 6.6M/s 太慢了）：

Create a 1G file with "random" data:

使用“随机”数据创建一个 1G 文件：

# dd if=/dev/sdb of=temp.dat bs=1M count=1024    
1073741824 bytes (1.1 GB) copied, 161.698 s, 6.6 MB/s

# time sha1sum -b temp.dat
abb88a0081f5db999d0701de2117d2cb21d192a2 *temp.dat

1m5.299s

1 分 5.299 秒

# time md5sum -b temp.dat
9995e1c1a704f9c1eb6ca11e7ecb7276 *temp.dat

1m58.832s

This is also weird, md5 is consistently slower than sha1 for me (reran several times).

这也很奇怪，md5 对我来说始终比 sha1 慢（重新运行几次）。

Answer 5

回答by crono

Ok - thanks to all of you - let me wrap this up:

好的 - 感谢你们所有人 - 让我总结一下：

using a "native" exeto do the hashing took time from 6 Minutes to 10 Seconds which is huge.
Increasing the bufferwas even faster - 1.6GB file took 5.2 seconds using MD5 in .Net, so I will go with this solution - thanks again

使用“本机”exe进行散列需要 6 分钟到 10 秒的时间，这是巨大的。
增加缓冲区甚至更快 - 1.6GB 文件在 .Net 中使用 MD5 需要 5.2 秒，所以我将采用这个解决方案 - 再次感谢

Answer 6

回答by Anders

I did tests with buffer size, running this code

我做了缓冲区大小的测试，运行了这段代码

using (var stream = new BufferedStream(File.OpenRead(file), bufferSize))
{
    SHA256Managed sha = new SHA256Managed();
    byte[] checksum = sha.ComputeHash(stream);
    return BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
}

And I tested with a file of 29? GB in size, the results were

我用 29 的文件进行了测试？大小为 GB，结果为

10.000: 369,24s
100.000: 362,55s
1.000.000: 361,53s
10.000.000: 434,15s
100.000.000: 435,15s
1.000.000.000: 434,31s
And 376,22s when using the original, none buffered code.

10.000: 369,24 秒
100.000：362,55 秒
1.000.000：361,53 秒
10.000.000：434,15 秒
100.000.000：435,15s
1.000.000.000：434,31s
使用原始无缓冲代码时为 376,22s。

I am running an i5 2500K CPU, 12 GB ram and a OCZ Vertex 4 256 GB SSD drive.

我正在运行 i5 2500K CPU、12 GB 内存和 OCZ Vertex 4 256 GB SSD 驱动器。

So I thought, what about a standard 2TB harddrive. And the results were like this

所以我想，标准的 2TB 硬盘怎么样。结果是这样的

10.000: 368,52s
100.000: 364,15s
1.000.000: 363,06s
10.000.000: 678,96s
100.000.000: 617,89s
1.000.000.000: 626,86s
And for none buffered 368,24

10.000: 368,52 秒
100.000：364,15 秒
1.000.000：363,06 秒
10.000.000：678,96 秒
100.000.000：617,89 秒
1.000.000.000：626,86 秒
对于没有缓冲的 368,24

So I would recommend either no buffer or a buffer of max 1 mill.

所以我建议要么不使用缓冲，要么使用最大 1 磨机的缓冲。

Answer 7

回答by Tal Aloni

As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:

正如 Anton Gogolev 所指出的，默认情况下 FileStream 一次读取 4096 个字节，但您可以使用 FileStream 构造函数指定任何其他值：

new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)

Note that Brad Abrams from Microsoft wrote in 2004:

请注意，来自微软的 Brad Abrams 在 2004 年写道：

there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream's buffering logic into FileStream about 4 years ago to encourage better default performance

将 BufferedStream 包装在 FileStream 周围的好处为零。大约 4 年前，我们将 BufferedStream 的缓冲逻辑复制到 FileStream 中，以鼓励更好的默认性能

source

来源

Answer 8

回答by Romil Kumar Jain

I know that I am late to party but performed test before actually implement the solution.

我知道我迟到了，但在实际实施解决方案之前进行了测试。

I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.

我确实对内置的 MD5 类和md5sum.exe进行了测试。在我的例子中，内置类需要 13 秒，其中 md5sum.exe 在每次运行中也需要大约 16-18 秒。

    DateTime current = DateTime.Now;
    string file = @"C:\text.iso";//It's 2.5 Gb file
    string output;
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(file))
        {
            byte[] checksum = md5.ComputeHash(stream);
            output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
            Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
        }
    }

Answer 9

回答by Fabske

You can have a look to XxHash.Net ( https://github.com/wilhelmliao/xxHash.NET)
The xxHash algorythm seems to be faster than all other.
Some benchmark on the xxHash site : https://github.com/Cyan4973/xxHash

你可以看看 XxHash.Net ( https://github.com/wilhelmliao/xxHash.NET)
xxHash 算法似乎比其他算法都快。
xxHash 网站上的一些基准测试：https: //github.com/Cyan4973/xxHash

PS: I've not yet used it.

PS：我还没用过。

在 C# 中为大文件创建校验和的最快方法是什么

提问by crono

采纳答案by Anton Gogolev

回答by Binary Worrier

回答by Christian Birkl

回答by Pasi Savolainen

回答by crono

回答by Anders

回答by Tal Aloni

回答by Romil Kumar Jain

回答by Fabske

相关推荐

最近更新

标签

在 C# 中为大文件创建校验和的最快方法是什么

提问by crono

采纳答案by Anton Gogolev

回答by Binary Worrier

回答by Christian Birkl

回答by Pasi Savolainen

回答by crono

回答by Anders

回答by Tal Aloni

回答by Romil Kumar Jain

回答by Fabske

相关推荐

C# 如何从字典中获取第 n 个元素？

C# XmlElement：SelectSingleNode 为空字符串返回 null？

需要 C# 函数将灰度 TIFF 转换为黑白（单色/1BPP）TIFF

C# 值为 0x0001 的枚举？

相关推荐

最近更新

标签