在 C# 中,如何在不逐行的情况下在大型文本文件中搜索字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2095437/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 23:27:50  来源:igfitidea点击:

How do you search a large text file for a string without going line by line in C#?

c#searchtext

提问by

I have a large text file that I need to search for a specific string. Is there a fast way to do this without reading line by line?

我有一个大文本文件,需要搜索特定字符串。有没有一种快速的方法来做到这一点而无需逐行阅读?

This method is extremely slow because of the size of the files (more than 100 MB).

由于文件的大小(超过 100 MB),此方法非常慢。

回答by Wayne Cornish

Given the size of the files would you really want to read them entirely into memory beforehand? Line by line is likely to be the best approach here.

鉴于文件的大小,您真的想事先将它们完全读入内存吗?逐行可能是这里最好的方法。

回答by Pavel Radzivilovsky

In all cases, you will have to go over all of the file.

在所有情况下,您都必须查看所有文件。

Lookup Rabin-Karp string searchor similar.

查找Rabin-Karp 字符串搜索或类似搜索

回答by Matthias

If you want to speed up the line-by-line reading you can create a queue-based application:
One thread reads the lines and enqeues them into a threadsafe Queue. A second one can then process the strings

如果您想加快逐行读取速度,您可以创建一个基于队列的应用程序:
一个线程读取这些行并将它们放入一个线程安全队列中。然后第二个可以处理字符串

回答by John Feminella

I have a large text file that I need to search for a specific string. Is there a fast way to do this without reading line by line?

我有一个大文本文件,需要搜索特定字符串。有没有一种快速的方法来做到这一点而无需逐行阅读?

The only way to avoid searching across the entire file is to sort or organize the input beforehand. For example, if this is an XML file and you need to do many of these searches, it would make sense to parse the XML file into a DOM tree. Or if this is a list of words and you're looking for all the words which start with the letters "aero", it might make sense to sort the entire input first if you do a lot of that kind of searching on the same file.

避免在整个文件中搜索的唯一方法是预先对输入进行排序或组织。例如,如果这是一个 XML 文件并且您需要执行许多此类搜索,那么将 XML 文件解析为 DOM 树是有意义的。或者,如果这是一个单词列表,并且您正在寻找以字母“aero”开头的所有单词,那么如果您对同一文件进行大量此类搜索,则首先对整个输入进行排序可能是有意义的.

回答by Sheff

The speed issue here could well be the speed taken to load the file into memory before performing the search. Try profiling your application to see where the bottleneck is. If it is loading the file you could try "chunking" the file load so that the file is streamed in small chunks and each chunk has the search performed on it.

这里的速度问题很可能是在执行搜索之前将文件加载到内存中的速度。尝试分析您的应用程序以查看瓶颈所在。如果它正在加载文件,您可以尝试“分块”文件加载,以便文件以小块流式传输,并且每个块都对其执行搜索。

Obviously if the part of the string to be found is at the end of the file there will be no performance gain.

显然,如果要找到的字符串部分位于文件末尾,则不会有性能提升。

回答by Chris Kannon

You could buffer a large amount of data from the file into memory at one time, up to whatever constraint you wish, and then search it for the string.

您可以一次将文件中的大量数据缓冲到内存中,直至达到您希望的任何约束,然后在其中搜索字符串。

This would have the effect of reducing the number of reads on the file and would likely be a faster method, but it would be more of a memory hog if you set the buffer size too high.

这将减少对文件的读取次数,并且可能是一种更快的方法,但如果您将缓冲区大小设置得太高,它将更多地占用内存。

回答by Brian Hasden

You should be able to read the file character by character matching each character in the search string until you reach the end of the search string in which case you have a match. If at any point the character you've read doesn't match the character you're looking for, reset the matched count to 0 and start again. For example (****pseudocode/not tested****):

您应该能够逐个匹配搜索字符串中的每个字符的字符读取文件,直到到达搜索字符串的末尾,在这种情况下,您有一个匹配项。如果在任何时候您阅读的字符与您要查找的字符不匹配,请将匹配的计数重置为 0 并重新开始。例如(****伪代码/未测试****):

byte[] lookingFor = System.Text.Encoding.UTF8.GetBytes("hello world");
int index = 0;
int position = 0;
bool matchFound = false;

using (FileStream fileStream = new FileStream(fileName, FileMode.Open))
{
  while (fileStream.ReadByte() == lookingFor[index])
  {
    index++;

    if (index == lookingFor.length) 
    {
       matchFound = true;
       position = File.position - lookingFor.length;
       break;
    }
  }
}

That is one of many algorithms you could use (although it may be off by one with the length check). It will only find the first match so you probably want to wrap the while loop in another loop to find multiple matches.

这是您可以使用的众多算法之一(尽管它可能与长度检查相差一个)。它只会找到第一个匹配项,因此您可能希望将 while 循环包装在另一个循环中以查找多个匹配项。

Also, one thing to note about reading the file line by line is that if the desired string to match spans lines you're not going to find it. If that's fine then you can search line by line but if you need search strings to span lines you'll want to use an algorithm like I detailed above.

此外,关于逐行读取文件要注意的一件事是,如果要匹配的所需字符串跨越行,您将无法找到它。如果没问题,那么您可以逐行搜索,但是如果您需要搜索字符串来跨越行,您将需要使用我上面详述的算法。

Finally, if you're looking for best speed, which it sounds like you are, you'll want to migrate the code above to use a StreamReaderor some other buffered reader.

最后,如果您正在寻找最佳速度,听起来您就是这样,您将需要迁移上面的代码以使用StreamReader或其他一些缓冲读取器。

回答by Daniel Earwicker

Is your project needing to search different files for the same or different string every time, or searching the same file for different strings every time?

您的项目是每次都需要在不同的文件中搜索相同或不同的字符串,还是每次都需要在同一个文件中搜索不同的字符串?

If it's the latter, you could build an index of the file. But there's no point doing this if the file changes frequently, because building the index will be expensive.

如果是后者,您可以构建文件的索引。但是,如果文件频繁更改,则没有必要这样做,因为构建索引的成本会很高。

To index a file for full text searching, you could use the Lucene.NET library.

要索引文件以进行全文搜索,您可以使用 Lucene.NET 库。

http://incubator.apache.org/lucene.net/

http://incubator.apache.org/lucene.net/

回答by Dathan

If you're only looking for a specific string, I'd say line-by-line is the best and most efficient mechanism. On the other hand, if you're going to be looking for multiple strings, particularly at several different points in the application, you might want to look into Lucene.Netto create an index and then query the index. If this is a one-off run (i.e., you won't need to query the same file again later), you can create the index in a temporary file that will be cleaned up automatically by the system (usually boot time; or you can delete it yourself when your program exits). If you need to search the same file again later, you can save the index in a known location and get much better performance the second time around.

如果您只是在寻找特定的字符串,我会说逐行是最好和最有效的机制。另一方面,如果您要查找多个字符串,尤其是在应用程序中的多个不同点,您可能需要查看Lucene.Net以创建索引,然后查询该索引。如果这是一次性运行(即您以后不需要再次查询同一个文件),您可以在一个临时文件中创建索引,该文件将被系统自动清理(通常是启动时;或者您程序退出时可以自行删除)。如果您稍后需要再次搜索同一个文件,您可以将索引保存在已知位置,并在第二次搜索时获得更好的性能。

回答by Ed Power

Stick it into SQL Server 2005/2008 and use its full-text search capability.

将其粘贴到 SQL Server 2005/2008 中并使用其全文搜索功能。