C# 有没有比这更快的方法来查找目录和所有子目录中的所有文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2106877/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 23:32:24  来源:igfitidea点击:

Is there a faster way than this to find all the files in a directory and all sub directories?

c#.netfile-iodirectory

提问by Eric Anastas

I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.

我正在编写一个程序,需要在一个目录及其所有子目录中搜索具有特定扩展名的文件。这将在本地和网络驱动器上使用,因此性能有点问题。

Here's the recursive method I'm using now:

这是我现在使用的递归方法:

private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
{
    DirectoryInfo di = new DirectoryInfo(rootFolderPath);

    FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    files.AddRange(fiArr);

    DirectoryInfo[] diArr = di.GetDirectories();

    foreach (DirectoryInfo info in diArr)
    {
        GetFileList(fileSearchPattern, info.FullName, files);
    }
}

I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.

我可以将 SearchOption 设置为 AllDirectories 而不是使用递归方法,但将来我想插入一些代码来通知用户当前正在扫描哪个文件夹。

While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?

虽然我现在正在创建 FileInfo 对象列表,但我真正关心的是文件的路径。我将有一个现有的文件列表,我想将其与新的文件列表进行比较,以查看添加或删除了哪些文件。有没有更快的方法来生成这个文件路径列表?我可以做些什么来优化此文件搜索,以查询共享网络驱动器上的文件?



Update 1

更新 1

I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:

我尝试创建一个非递归方法,它首先查找所有子目录,然后迭代扫描每个目录中的文件来执行相同的操作。这是方法:

public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);

    List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
    dirList.Add(rootDir);

    List<FileInfo> fileList = new List<FileInfo>();

    foreach (DirectoryInfo dir in dirList)
    {
        fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
    }

    return fileList;
}


Update 2

更新 2

Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.

好的,所以我在本地和远程文件夹上运行了一些测试,这两个文件夹都有很多文件(~1200)。这是我运行测试的方法。结果如下。

  • GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
  • GetFileListB(): Recursive method from the original question
  • GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
  • GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
    • DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
  • Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
  • GetFileListA():上述更新中的非递归解决方案。我认为这相当于杰伊的解决方案。
  • GetFileListB():来自原始问题的递归方法
  • GetFileListC():使用静态 Directory.GetDirectories() 方法获取所有目录。然后使用静态 Directory.GetFiles() 方法获取所有文件路径。填充并返回一个列表
  • GetFileListD():Marc Gravell 的解决方案使用队列并返回 IEnumberable。我用生成的 IEnumerable 填充了一个列表
    • DirectoryInfo.GetFiles:未创建其他方法。从根文件夹路径实例化 DirectoryInfo。使用 SearchOption.AllDirectories 调用 GetFiles
  • Directory.GetFiles:未创建其他方法。使用 SearchOption.AllDirectories 调用目录的静态 GetFiles 方法
Method                       Local Folder       Remote Folder
GetFileListA()               00:00.0781235      05:22.9000502
GetFileListB()               00:00.0624988      03:43.5425829
GetFileListC()               00:00.0624988      05:19.7282361
GetFileListD()               00:00.0468741      03:38.1208120
DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
Directory.GetFiles           00:00.0312494      03:48.0737459
Method                       Local Folder       Remote Folder
GetFileListA()               00:00.0781235      05:22.9000502
GetFileListB()               00:00.0624988      03:43.5425829
GetFileListC()               00:00.0624988      05:19.7282361
GetFileListD()               00:00.0468741      03:38.1208120
DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
Directory.GetFiles           00:00.0312494      03:48.0737459

. . .so looks like Marc's is the fastest.

. . .so 看起来 Marc 是最快的。

采纳答案by Marc Gravell

Try this iterator block version that avoids recursion and the Infoobjects:

试试这个避免递归和Info对象的迭代器块版本:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    Queue<string> pending = new Queue<string>();
    pending.Enqueue(rootFolderPath);
    string[] tmp;
    while (pending.Count > 0)
    {
        rootFolderPath = pending.Dequeue();
        try
        {
            tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
        }
        catch (UnauthorizedAccessException)
        {
            continue;
        }
        for (int i = 0; i < tmp.Length; i++)
        {
            yield return tmp[i];
        }
        tmp = Directory.GetDirectories(rootFolderPath);
        for (int i = 0; i < tmp.Length; i++)
        {
            pending.Enqueue(tmp[i]);
        }
    }
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

另请注意,4.0 具有可能更快的内置迭代器块版本 ( EnumerateFiles, EnumerateFileSystemEntries)(更直接地访问文件系统;更少的数组)

回答by Jay

I'd be inclined to return an IEnumerable<> in this case -- depending on how you are consuming the results, it could be an improvement, plus you reduce your parameter footprint by 1/3 and avoid passing around that List incessantly.

在这种情况下,我倾向于返回一个 IEnumerable<> —— 取决于你如何使用结果,这可能是一个改进,加上你将参数占用空间减少了 1/3 并避免不断传递该 List。

private IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    DirectoryInfo di = new DirectoryInfo(rootFolderPath);

    var fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    foreach (FileInfo fi in fiArr)
    {
        yield return fi;
    }

    var diArr = di.GetDirectories();

    foreach (DirectoryInfo di in diArr)
    {
        var nextRound = GetFileList(fileSearchPattern, di.FullnName);
        foreach (FileInfo fi in nextRound)
        {
            yield return fi;
        }
    }
    yield break;
}

Another idea would be to spin off BackgroundWorkerobjects to troll through directories. You wouldn't want a new thread for every directory, but you might create them on the top level (first pass through GetFileList()), so if you execute on your C:\drive, with 12 directories, each of those directories will be searched by a different thread, which will then recurse through subdirectories. You'll have one thread going through C:\Windowswhile another goes through C:\Program Files. There are a lot of variables as to how this is going to affect performance -- you'd have to test it to see.

另一个想法是剥离BackgroundWorker对象以浏览目录。您不希望每个目录都有一个新线程,但您可以在顶层创建它们(第一次通过GetFileList()),因此如果您在C:\驱动器上执行,有 12 个目录,每个目录将被不同的线程搜索,然后将通过子目录递归。您将有一个线程通过,C:\Windows而另一个线程通过C:\Program Files。关于这将如何影响性能有很多变量——您必须对其进行测试才能看到。

回答by ata

You can use parallel foreach (.Net 4.0) or you can try Poor Man's Parallel.ForEach Iteratorfor .Net3.5 . That can speed-up your search.

您可以使用并行 foreach (.Net 4.0) 或者您可以尝试使用Poor Man's Parallel.ForEach Iteratorfor .Net3.5。这可以加快您的搜索速度。

回答by Jay

Consider splitting the updated method into two iterators:

考虑将更新的方法拆分为两个迭代器:

private static IEnumerable<DirectoryInfo> GetDirs(string rootFolderPath)
{
     DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
     yield return rootDir;

     foreach(DirectoryInfo di in rootDir.GetDirectories("*", SearchOption.AllDirectories));
     {
          yield return di;
     }
     yield break;
}

public static IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
     var allDirs = GetDirs(rootFolderPath);
     foreach(DirectoryInfo di in allDirs())
     {
          var files = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
          foreach(FileInfo fi in files)
          {
               yield return fi;
          }
     }
     yield break;
}

Also, further to the network-specific scenario, if you were able to install a small service on that server that you could call into from a client machine, you'd get much closer to your "local folder" results, because the search could execute onthe server and just return the results to you. This would be your biggest speed boost in the network folder scenario, but may not be available in your situation. I've been using a file synchronization program that includes this option -- once I installed the service on my server the program became WAYfaster at identifying the files that were new, deleted, and out-of-sync.

此外,对于特定于网络的场景,如果您能够在该服务器上安装一个可以从客户端计算机调用的小型服务,那么您将更接近“本地文件夹”结果,因为搜索可以服务器执行并将结果返回给您。这将是网络文件夹方案中最大的速度提升,但在您的情况下可能不可用。我一直在使用,其中包括该选项文件同步程序是-有一次我在我的服务器程序成为上安装的服务WAY的标识是新的速度更快的文件,删除和出不同步。

回答by Brad Cunningham

Cool question.

很酷的问题。

I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%

我玩了一会儿,通过利用迭代器块和 LINQ,我似乎将您修改后的实现改进了大约 40%

I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.

我有兴趣让您使用您的计时方法并在您的网络上对其进行测试,看看有什么不同。

Here is the meat of it

这是它的肉

private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
    var rootDir = new DirectoryInfo(rootFolderPath);
    var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);

    return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
           select directoriesWithFiles;
}

private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
    foreach (DirectoryInfo dir in dirList)
    {
        yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    }
}

回答by Paul Rohde

The short answer of how to improve the performance of that code is: You cant.

如何提高该代码的性能的简短回答是:你不能。

The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

真正影响您体验的性能是磁盘或网络的实际延迟,因此无论您以哪种方式翻转它,您都必须检查和迭代每个文件项并检索目录和文件列表。(这当然不包括硬件或驱动程序修改以减少或改善磁盘延迟,但很多人已经支付了很多钱来解决这些问题,所以我们现在将忽略它的这一方面)

Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

鉴于最初的限制,已经发布了几种解决方案,或多或少优雅地包装了迭代过程(但是,由于我假设我正在从单个硬盘驱动器读取,并行性将无助于更快地横向遍历目录树,并且甚至可能会增加该时间,因为您现在有两个或更多线程在驱动器的不同部分争夺数据,因为它试图回溯和第四)减少创建的对象数量等。但是,如果我们评估函数将如何由最终开发人员使用,我们可以提出一些优化和概括。

First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

首先,我们可以通过返回 IEnumerable 来延迟性能的执行,yield return 通过在实现 IEnumerable 的匿名类内部的状态机枚举器中进行编译来实现这一点,并在方法执行时返回。编写 LINQ 中的大多数方法是为了延迟执行直到执行迭代,因此在迭代 IEnumerable 之前,不会执行 select 或 SelectMany 中的代码。延迟执行的最终结果只有在您稍后需要获取数据的子集时才会感觉到,例如,如果您只需要前 10 个结果,则延迟执行返回数千个结果的查询不会遍历整个 1000 个结果,直到您需要 10 个以上。

Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorizationyou can simply make a slight modification instead of re-writing the function.

现在,考虑到您想要进行子文件夹搜索,我还可以推断,如果您可以指定该深度,它可能会很有用,如果我这样做,它也会概括我的问题,但也需要递归解决方案。然后,当有人因为我们增加了文件数量并决定添加另一层分类而决定现在需要深入搜索两个目录时,您只需稍作修改即可,而无需重新编写函数。

In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

鉴于所有这些,这是我提出的解决方案,它提供了比上述其他一些解决方案更通用的解决方案:

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
    return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
    return depth == 0
        ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
        : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
            directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}

On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.

顺便提一下,到目前为止,任何人都没有提到的其他内容是文件权限和安全性。目前,没有检查、处理或权限请求,如果代码遇到它无权迭代的目录,它将抛出文件权限异常。

回答by Jaider

Try Parallel programming:

尝试并行编程:

private string _fileSearchPattern;
private List<string> _files;
private object lockThis = new object();

public List<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    _fileSearchPattern = fileSearchPattern;
    AddFileList(rootFolderPath);
    return _files;
}

private void AddFileList(string rootFolderPath)
{
    var files = Directory.GetFiles(rootFolderPath, _fileSearchPattern);
    lock (lockThis)
    {
        _files.AddRange(files);
    }

    var directories = Directory.GetDirectories(rootFolderPath);

    Parallel.ForEach(directories, AddFileList); // same as Parallel.ForEach(directories, directory => AddFileList(directory));
}

回答by user2385360

DirectoryInfo seems to give much more information than you need, try piping a dir command and parsing the info from that.

DirectoryInfo 似乎提供了比您需要的更多的信息,请尝试使用 dir 命令并从中解析信息。

回答by Anonymous Coward

The BCL methods are portable so to speak. If staying 100% managed I believe the best you can do is calling GetDirectories/Folders while checking access rights (or possibly not checking the rights and having another thread ready to go when the first one takes a little too long - a sign that it's about to throw UnauthorizedAccess exception - this might be avoidable with exception filters using VB or as of today unreleased c#).

可以这么说,BCL 方法是可移植的。如果保持 100% 管理,我相信你能做的最好的事情就是在检查访问权限时调用 GetDirectories/Folders(或者可能不检查权限并在第一个线程花费太长时间时准备好另一个线程 - 这表明它是关于抛出 UnauthorizedAccess 异常 - 这可能可以通过使用 VB 或截至今天未发布的 c# 的异常过滤器来避免)。

If you want faster than GetDirectories you have to call win32 (findsomethingEx etc) which provides specific flags that allow ignore possibly unnecessary IO while traversing the MFT structures. Also if the drive is a network share, there can be a great speedup by similar approach but this time avoiding also excessive network roundtrips.

如果你想要比 GetDirectories 更快,你必须调用 win32(findsomethingEx 等),它提供了特定的标志,允许在遍历 MFT 结构时忽略可能不必要的 IO。此外,如果驱动器是网络共享,通过类似的方法可以有很大的加速,但这次也避免了过多的网络往返。

Now if you have admin and use ntfs and are in a real hurry with millions of files to go through, the absolute fastest way to go through them (assuming spinning rust where the disk latency kills) is use of both mft and journaling in combination, essentially replacing the indexing service with one that's targeted for your specific need. If you only need to find filenames and not sizes (or sizes too but then you must cache them and use journal to notice changes), this approach could allow for practically instant search of tens of millions of files and folders if implemented ideally. There may be one or two paywares that have bothered with this. There's samples of both MFT (DiscUtils) and journal reading (google) in C# around. I only have about 5 million files and just using NTFSSearch is good enough for that amount as it takes about 10-20 seconds to search them. With journal reading added it would go down to <3 seconds for that amount.

现在,如果您有管理员并使用 ntfs 并且非常急于处理数百万个文件,那么绝对最快的方式来处理它们(假设在磁盘延迟杀死的地方旋转锈蚀)是结合使用 mft 和日记,本质上是用针对您的特定需求的索引服务替换索引服务。如果您只需要查找文件名而不是大小(或大小也是如此,但您必须缓存它们并使用日志来注意更改),如果实施得当,这种方法可以允许几乎即时搜索数千万个文件和文件夹。可能有一两个付费软件对此感到困扰。在 C# 中有 MFT (DiscUtils) 和日志阅读 (google) 的示例。我只有大约 500 万个文件,仅使用 NTFSSearch 就足够了,因为搜索它们大约需要 10-20 秒。添加日记阅读后,该数量将下降到 <3 秒。

回答by Bob

It is horrible, and the reason file search work is horrible on Windows platforms is because MS made a mistake, that they seem unwilling to put right. You should be able to use SearchOption.AllDirectories And we would all get the speed back that we want. But you can not do that, because GetDirectories needs a call back so that you can decide what to do about the directories you do not have access to. MS forgot or did not think to test the class on their own computers.

这太可怕了,Windows 平台上的文件搜索工作糟糕的原因是因为 MS 犯了一个错误,他们似乎不愿意纠正。您应该能够使用 SearchOption.AllDirectories 我们都将获得我们想要的速度。但是您不能这样做,因为 GetDirectories 需要回调,以便您可以决定如何处理您无权访问的目录。MS 忘记或没想到在自己的计算机上测试课程。

So, we are all left with the nonsense recursive loops.

所以,我们都剩下无意义的递归循环。

Within C#/Managed C++ you have very few oprions, these are also the options that MS take, because their coders haven't worked out how to get around it either.

在 C#/Managed C++ 中,您几乎没有选择,这些也是 MS 采取的选项,因为他们的编码人员也没有想出如何解决它。

The main thing is with display items, such as TreeViews and FileViews, only search and show what users can see. There are plaenty of helpers on the controls, including triggers, that tell you when you need to fill in some data.

主要是显示项目,例如 TreeViews 和 FileViews,只搜索和显示用户可以看到的内容。控件上有很多帮助程序,包括触发器,它们会告诉您何时需要填写某些数据。

In trees, starting from collapsed mode, search that one directory as and when the user opens it in the tree, that is much faster than the wait for a whole tree to be filled. The same in FileViews, I tend towards a 10% rule, how ever many items fit in the display area, have another 10% ready if the user scrolls, it is nicely responsive.

在树中,从折叠模式开始,在用户在树中打开该目录时搜索该目录,这比等待整个树被填满要快得多。在 FileViews 中也是如此,我倾向于 10% 的规则,显示区域中有多少项目适合,如果用户滚动,则另外 10% 准备好,这是很好的响应。

MS do the pre-search and directory watch. A little database of directories, files, this means that you OnOpen your Trees etc have a good fast starting point, it falls down a bit on the refresh.

MS 进行预搜索和目录监视。一个目录、文件的小数据库,这意味着你 OnOpen 你的树等有一个很好的快速起点,它在刷新时有点下降。

But mix the two ideas, take your directories and files from the database, but do a refresh search as a tree node is expanded (just that tree node) and as a different directory is selected in the tree.

但是混合这两种想法,从数据库中获取目录和文件,但是在扩展树节点(仅该树节点)并在树中选择不同目录时进行刷新搜索。

But the better sollution is to add your file search system as a service. MS already have this, but as far as I know we do not get access to it, I suspect that is because it is immune to 'failed access to directory' errors. Just as with the MS one, if you have a service running at Admin level, you need to be careful that you are not giving away your security just for the sake of a little extra speed.

但更好的解决方案是将文件搜索系统添加为服务。MS 已经有了这个,但据我所知我们无法访问它,我怀疑这是因为它不受“访问目录失败”错误的影响。与 MS 一样,如果您有一项在管理员级别运行的服务,您需要小心不要仅仅为了一点额外的速度而放弃您的安全性。