在 C# 中使用流读取大型文本文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2161895/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading large text files with streams in C#
提问by Nicole Lee
I've got the lovely task of working out how to handle large files being loaded into our application's script editor (it's like VBAfor our internal product for quick macros). Most files are about 300-400 KB which is fine loading. But when they go beyond 100 MB the process has a hard time (as you'd expect).
我有一项可爱的任务,即研究如何处理加载到我们应用程序脚本编辑器中的大文件(就像我们内部产品的VBA用于快速宏)。大多数文件大约为 300-400 KB,可以很好地加载。但是当它们超过 100 MB 时,这个过程会很困难(正如您所期望的)。
What happens is that the file is read and shoved into a RichTextBox which is then navigated - don't worry too much about this part.
发生的情况是文件被读取并推入 RichTextBox 中,然后进行导航 - 不要太担心这部分。
The developer who wrote the initial code is simply using a StreamReader and doing
编写初始代码的开发人员只是使用 StreamReader 并执行
[Reader].ReadToEnd()
which could take quite a while to complete.
这可能需要很长时间才能完成。
My task is to break this bit of code up, read it in chunks into a buffer and show a progressbar with an option to cancel it.
我的任务是分解这段代码,将其分块读入缓冲区,并显示一个带有取消选项的进度条。
Some assumptions:
一些假设:
- Most files will be 30-40 MB
- The contents of the file is text (not binary), some are Unix format, some are DOS.
- Once the contents is retrieved we work out what terminator is used.
- No-one's concerned once it's loaded the time it takes to render in the richtextbox. It's just the initial load of the text.
- 大多数文件将是 30-40 MB
- 文件的内容是文本(不是二进制),有些是 Unix 格式,有些是 DOS。
- 一旦检索到内容,我们就可以确定使用了什么终止符。
- 没有人关心一旦加载它在富文本框中呈现所需的时间。这只是文本的初始加载。
Now for the questions:
现在的问题:
- Can I simply use StreamReader, then check the Length property (so ProgressMax) and issue a Read for a set buffer size and iterate through in a while loop WHILSTinside a background worker, so it doesn't block the main UI thread? Then return the stringbuilder to the main thread once it's completed.
- The contents will be going to a StringBuilder. can I initialise the StringBuilder with the size of the stream if the length is available?
- 我可以简单地使用的StreamReader,然后通过在while循环检查length属性(所以ProgressMax),并发出读了缓冲的大小和迭代停驶时后台工作里面,所以它不会阻塞主UI线程?然后在完成后将 stringbuilder 返回到主线程。
- 内容将转到 StringBuilder。如果长度可用,我可以用流的大小初始化 StringBuilder 吗?
Are these (in your professional opinions) good ideas? I've had a few issues in the past with reading content from Streams, because it will always miss the last few bytes or something, but I'll ask another question if this is the case.
这些(在您的专业意见中)是好主意吗?过去我在从 Streams 读取内容时遇到了一些问题,因为它总是会错过最后几个字节或其他东西,但如果是这种情况,我会问另一个问题。
回答by Tufo
Use a background worker and read only a limited number of lines. Read more only when the user scrolls.
使用后台工作人员并仅读取有限数量的行。仅在用户滚动时阅读更多内容。
And try to never use ReadToEnd(). It's one of the functions that you think "why did they make it?"; it's a script kiddies'helper that goes fine with small things, but as you see, it sucks for large files...
并尽量不要使用 ReadToEnd()。这是您认为“他们为什么要这样做?”的功能之一;这是一个脚本小子的助手,可以很好地处理小事情,但正如你所见,它对于大文件很糟糕......
Those guys telling you to use StringBuilder need to read the MSDN more often:
那些告诉你使用 StringBuilder 的人需要更频繁地阅读 MSDN:
Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer.
The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs.
A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.
性能注意事项
Concat 和 AppendFormat 方法都将新数据连接到现有的 String 或 StringBuilder 对象。字符串对象连接操作总是从现有字符串和新数据创建一个新对象。StringBuilder 对象维护一个缓冲区以容纳新数据的串联。如果空间可用,新数据将附加到缓冲区的末尾;否则,分配一个新的更大的缓冲区,将原始缓冲区中的数据复制到新缓冲区,然后将新数据附加到新缓冲区。String 或 StringBuilder 对象的串联操作的性能取决于内存分配发生的频率。
String 串联操作总是分配内存,而 StringBuilder 串联操作仅在 StringBuilder 对象缓冲区太小而无法容纳新数据时才分配内存。因此,如果连接固定数量的 String 对象,则 String 类更适合用于连接操作。在这种情况下,编译器甚至可能将单个连接操作合并为单个操作。如果连接任意数量的字符串,则 StringBuilder 对象更适合用于连接操作;例如,如果循环连接随机数量的用户输入字符串。
That means hugeallocation of memory, what becomes large use of swap files system, that simulates sections of your hard disk drive to act like the RAM memory, but a hard disk drive is very slow.
这意味着大量的内存分配,即大量使用交换文件系统,它模拟硬盘驱动器的各个部分以充当 RAM 内存,但硬盘驱动器非常慢。
The StringBuilder option looks fine for who use the system as a mono-user, but when you have two or more users reading large files at the same time, you have a problem.
StringBuilder 选项对于以单用户身份使用系统的人来说看起来不错,但是当您有两个或更多用户同时读取大文件时,就会出现问题。
回答by t0mm13b
You might be better off to use memory-mapped files handling here.. The memory mapped file support will be around in .NET 4 (I think...I heard that through someone else talking about it), hence this wrapper which uses p/invokes to do the same job..
您可能最好在此处使用内存映射文件处理.. .NET 4 中将提供内存映射文件支持(我认为......我是通过其他人谈论它听到的),因此这个使用 p 的包装器/调用做同样的工作..
Edit:See here on the MSDNfor how it works, here's the blogentry indicating how it is done in the upcoming .NET 4 when it comes out as release. The link I have given earlier on is a wrapper around the pinvoke to achieve this. You can map the entire file into memory, and view it like a sliding window when scrolling through the file.
编辑:请参阅MSDN上的此处了解其工作原理,此处的博客条目说明了即将发布的 .NET 4 中的工作方式。我之前给出的链接是围绕 pinvoke 的包装来实现这一点。您可以将整个文件映射到内存中,并在滚动文件时像滑动窗口一样查看它。
回答by James
Have a look at the following code snippet. You have mentioned Most files will be 30-40 MB
. This claims to read 180 MB in 1.4 seconds on an Intel Quad Core:
看看下面的代码片段。你提到过Most files will be 30-40 MB
。这声称在英特尔四核上在 1.4 秒内读取了 180 MB:
private int _bufferSize = 16384;
private void ReadFile(string filename)
{
StringBuilder stringBuilder = new StringBuilder();
FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read);
using (StreamReader streamReader = new StreamReader(fileStream))
{
char[] fileContents = new char[_bufferSize];
int charsRead = streamReader.Read(fileContents, 0, _bufferSize);
// Can't do much with 0 bytes
if (charsRead == 0)
throw new Exception("File is 0 bytes");
while (charsRead > 0)
{
stringBuilder.Append(fileContents);
charsRead = streamReader.Read(fileContents, 0, _bufferSize);
}
}
}
回答by ChaosPandion
This should be enough to get you started.
这应该足以让您入门。
class Program
{
static void Main(String[] args)
{
const int bufferSize = 1024;
var sb = new StringBuilder();
var buffer = new Char[bufferSize];
var length = 0L;
var totalRead = 0L;
var count = bufferSize;
using (var sr = new StreamReader(@"C:\Temp\file.txt"))
{
length = sr.BaseStream.Length;
while (count > 0)
{
count = sr.Read(buffer, 0, bufferSize);
sb.Append(buffer, 0, count);
totalRead += count;
}
}
Console.ReadKey();
}
}
回答by Christian Hayter
You say you have been asked to show a progress bar while a large file is loading. Is that because the users genuinely want to see the exact % of file loading, or just because they want visual feedback that something is happening?
您说您被要求在加载大文件时显示进度条。那是因为用户真的想看到文件加载的确切百分比,还是仅仅因为他们想要看到正在发生的事情的视觉反馈?
If the latter is true, then the solution becomes much simpler. Just do reader.ReadToEnd()
on a background thread, and display a marquee-type progress bar instead of a proper one.
如果后者为真,那么解决方案就变得简单多了。只需reader.ReadToEnd()
在后台线程上执行,并显示一个选取框类型的进度条而不是正确的进度条。
I raise this point because in my experience this is often the case. When you are writing a data processing program, then users will definitely be interested in a % complete figure, but for simple-but-slow UI updates, they are more likely to just want to know that the computer hasn't crashed. :-)
我提出这一点是因为根据我的经验,情况往往如此。当你在编写一个数据处理程序时,用户肯定会对一个完整的百分比感兴趣,但是对于简单但缓慢的 UI 更新,他们更有可能只想知道计算机没有崩溃。:-)
回答by Extremeswank
An iterator might be perfect for this type of work:
迭代器可能非常适合此类工作:
public static IEnumerable<int> LoadFileWithProgress(string filename, StringBuilder stringData)
{
const int charBufferSize = 4096;
using (FileStream fs = File.OpenRead(filename))
{
using (BinaryReader br = new BinaryReader(fs))
{
long length = fs.Length;
int numberOfChunks = Convert.ToInt32((length / charBufferSize)) + 1;
double iter = 100 / Convert.ToDouble(numberOfChunks);
double currentIter = 0;
yield return Convert.ToInt32(currentIter);
while (true)
{
char[] buffer = br.ReadChars(charBufferSize);
if (buffer.Length == 0) break;
stringData.Append(buffer);
currentIter += iter;
yield return Convert.ToInt32(currentIter);
}
}
}
}
You can call it using the following:
您可以使用以下方法调用它:
string filename = "C:\myfile.txt";
StringBuilder sb = new StringBuilder();
foreach (int progress in LoadFileWithProgress(filename, sb))
{
// Update your progress counter here!
}
string fileData = sb.ToString();
As the file is loaded, the iterator will return the progress number from 0 to 100, which you can use to update your progress bar. Once the loop has finished, the StringBuilder will contain the contents of the text file.
加载文件时,迭代器将返回从 0 到 100 的进度编号,您可以使用它来更新进度条。循环完成后,StringBuilder 将包含文本文件的内容。
Also, because you want text, we can just use BinaryReader to read in characters, which will ensure that your buffers line up correctly when reading any multi-byte characters (UTF-8, UTF-16, etc.).
此外,因为您需要文本,我们可以只使用 BinaryReader 读取字符,这将确保您的缓冲区在读取任何多字节字符(UTF-8、UTF-16等)时正确排列。
This is all done without using background tasks, threads, or complex custom state machines.
这一切都是在不使用后台任务、线程或复杂的自定义状态机的情况下完成的。
回答by Eric J.
You can improve read speed by using a BufferedStream, like this:
您可以通过使用 BufferedStream 来提高读取速度,如下所示:
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
}
}
March 2013 UPDATE
2013 年 3 月更新
I recently wrote code for reading and processing (searching for text in) 1 GB-ish text files (much larger than the files involved here) and achieved a significant performance gain by using a producer/consumer pattern. The producer task read in lines of text using the BufferedStream
and handed them off to a separate consumer task that did the searching.
我最近编写了用于读取和处理(在其中搜索文本)1 GB 大小的文本文件(比此处涉及的文件大得多)的代码,并通过使用生产者/消费者模式实现了显着的性能提升。生产者任务使用 读取文本行,BufferedStream
并将它们交给一个单独的消费者任务进行搜索。
I used this as an opportunity to learn TPL Dataflow, which is very well suited for quickly coding this pattern.
我以此作为学习 TPL Dataflow 的机会,它非常适合快速编码此模式。
Why BufferedStream is faster
为什么 BufferedStream 更快
A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance. A buffer can be used for either reading or writing, but never both simultaneously. The Read and Write methods of BufferedStream automatically maintain the buffer.
缓冲区是内存中用于缓存数据的字节块,从而减少对操作系统的调用次数。缓冲区提高了读写性能。缓冲区可用于读取或写入,但不能同时用于两者。BufferedStream 的 Read 和 Write 方法自动维护缓冲区。
December 2014 UPDATE: Your Mileage May Vary
2014 年 12 月更新:您的里程可能会有所不同
Based on the comments, FileStream should be using a BufferedStreaminternally. At the time this answer was first provided, I measured a significant performance boost by adding a BufferedStream. At the time I was targeting .NET 3.x on a 32-bit platform. Today, targeting .NET 4.5 on a 64-bit platform, I do not see any improvement.
根据评论,FileStream 应该在内部使用BufferedStream。在首次提供此答案时,我通过添加 BufferedStream 测量到显着的性能提升。当时我的目标是 32 位平台上的 .NET 3.x。今天,针对 64 位平台上的 .NET 4.5,我没有看到任何改进。
Related
有关的
I came across a case where streaming a large, generated CSV file to the Response stream from an ASP.Net MVC action was very slow. Adding a BufferedStream improved performance by 100x in this instance. For more see Unbuffered Output Very Slow
我遇到过这样一种情况:将生成的大型 CSV 文件从 ASP.Net MVC 操作流式传输到响应流非常慢。在此实例中,添加 BufferedStream 将性能提高了 100 倍。有关更多信息,请参阅无缓冲输出非常慢
回答by Eric J.
If you read the performance and benchmark stats on this website, you'll see that the fastest way to read(because reading, writing, and processing are all different) a text file is the following snippet of code:
如果您阅读本网站上的性能和基准统计数据,您会发现读取文本文件的最快方式(因为读取、写入和处理都不同)是以下代码片段:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do your stuff here
}
}
All up about 9 different methods were bench marked, but that one seem to come out ahead the majority of the time, even out performing the buffered readeras other readers have mentioned.
大约 9 种不同的方法都经过基准测试,但在大多数情况下,这种方法似乎领先,甚至像其他读者提到的那样执行缓冲阅读器。
回答by StainlessBeer
For binary files, the fastest way of reading them I have found is this.
对于二进制文件,我发现读取它们的最快方法是这样。
MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(file);
MemoryMappedViewStream mms = mmf.CreateViewStream();
using (BinaryReader b = new BinaryReader(mms))
{
}
In my tests it's hundreds of times faster.
在我的测试中,它快了数百倍。
回答by Rusty Nail
All excellent answers! however, for someone looking for an answer, these appear to be somewhat incomplete.
所有优秀的答案!然而,对于寻找答案的人来说,这些似乎有些不完整。
As a standard String can only of Size X, 2Gb to 4Gb depending on your configuration, these answers do not really fulfil the OP's question. One method is to work with a List of Strings:
由于标准字符串只能是 X 大小、2Gb 到 4Gb,具体取决于您的配置,因此这些答案并不能真正满足 OP 的问题。一种方法是使用字符串列表:
List<string> Words = new List<string>();
using (StreamReader sr = new StreamReader(@"C:\Temp\file.txt"))
{
string line = string.Empty;
while ((line = sr.ReadLine()) != null)
{
Words.Add(line);
}
}
Some may want to Tokenise and split the line when processing. The String List now can contain very large volumes of Text.
有些人可能希望在处理时标记并拆分行。字符串列表现在可以包含非常大量的文本。