C# 从文本文件中删除重复的行？

Question

提问by Goober

Given an input file of text lines, I want duplicate lines to be identified and removed. Please show a simple snippet of C# that accomplishes this.

给定文本行的输入文件，我希望识别和删除重复的行。请展示一个简单的 C# 片段来完成这个。

Answer 1

采纳答案by Jon Skeet

This should do (and will copy with large files).

这应该可以（并且将复制大文件）。

Note that it only removes duplicate consecutivelines, i.e.

请注意，它只会删除重复的连续行，即

a
b
b
c
b
d

will end up as

最终会变成

a
b
c
b
d

If you want no duplicates anywhere, you'll need to keep a set of lines you've already seen.

如果你不想在任何地方重复，你需要保留一组你已经看过的行。

using System;
using System.IO;

class DeDuper
{
    static void Main(string[] args)
    {
        if (args.Length != 2)
        {
            Console.WriteLine("Usage: DeDuper <input file> <output file>");
            return;
        }
        using (TextReader reader = File.OpenText(args[0]))
        using (TextWriter writer = File.CreateText(args[1]))
        {
            string currentLine;
            string lastLine = null;

            while ((currentLine = reader.ReadLine()) != null)
            {
                if (currentLine != lastLine)
                {
                    writer.WriteLine(currentLine);
                    lastLine = currentLine;
                }
            }
        }
    }
}

Note that this assumes Encoding.UTF8, and that you want to use files. It's easy to generalize as a method though:

请注意，这假定Encoding.UTF8, 并且您要使用文件。不过，它很容易概括为一种方法：

static void CopyLinesRemovingConsecutiveDupes
    (TextReader reader, TextWriter writer)
{
    string currentLine;
    string lastLine = null;

    while ((currentLine = reader.ReadLine()) != null)
    {
        if (currentLine != lastLine)
        {
            writer.WriteLine(currentLine);
            lastLine = currentLine;
        }
    }
}

(Note that that doesn't close anything - the caller should do that.)

（请注意，这不会关闭任何东西 - 调用者应该这样做。）

Here's a version that will remove allduplicates, rather than just consecutive ones:

这是一个将删除所有重复项而不仅仅是连续重复项的版本：

static void CopyLinesRemovingAllDupes(TextReader reader, TextWriter writer)
{
    string currentLine;
    HashSet<string> previousLines = new HashSet<string>();

    while ((currentLine = reader.ReadLine()) != null)
    {
        // Add returns true if it was actually added,
        // false if it was already there
        if (previousLines.Add(currentLine))
        {
            writer.WriteLine(currentLine);
        }
    }
}

Answer 2

回答by Darin Dimitrov

For small files:

对于小文件：

string[] lines = File.ReadAllLines("filename.txt");
File.WriteAllLines("filename.txt", lines.Distinct().ToArray());

Answer 3

回答by Kelly Gendron

For a long file (and non consecutive duplications) I'd copy the files line by line building a hash // position lookup table as I went.

对于长文件（和非连续重复），我会逐行复制文件，并在我进行时构建哈希 // 位置查找表。

As each line is copied check for the hashed value, if there is a collision double check that the line is the same and move to the next. (

复制每一行时，检查散列值，如果发生冲突，请仔细检查该行是否相同并移至下一行。(

Only worth it for fairly large files though.

不过，仅对于相当大的文件才值得。

Answer 4

回答by Steve

Here's a streaming approach that should incur less overhead than reading all unique strings into memory.

这是一种流方法，它应该比将所有唯一字符串读入内存产生更少的开销。

    var sr = new StreamReader(File.OpenRead(@"C:\Temp\in.txt"));
    var sw = new StreamWriter(File.OpenWrite(@"C:\Temp\out.txt"));
    var lines = new HashSet<int>();
    while (!sr.EndOfStream)
    {
        string line = sr.ReadLine();
        int hc = line.GetHashCode();
        if(lines.Contains(hc))
            continue;

        lines.Add(hc);
        sw.WriteLine(line);
    }
    sw.Flush();
    sw.Close();
    sr.Close();

Answer 5

回答by DeepakTheGeek

I am new to .net & have written something more simpler,may not be very efficient.Please fill free to share your thoughts.

我是 .net 的新手，写了一些更简单的东西，可能效率不高。请随意分享您的想法。

class Program
{
    static void Main(string[] args)
    {
        string[] emp_names = File.ReadAllLines("D:\Employee Names.txt");
        List<string> newemp1 = new List<string>();

        for (int i = 0; i < emp_names.Length; i++)
        {
            newemp1.Add(emp_names[i]);  //passing data to newemp1 from emp_names
        }

        for (int i = 0; i < emp_names.Length; i++)
        {
            List<string> temp = new List<string>();
            int duplicate_count = 0;

            for (int j = newemp1.Count - 1; j >= 0; j--)
            {
                if (emp_names[i] != newemp1[j])  //checking for duplicate records
                    temp.Add(newemp1[j]);
                else
                {
                    duplicate_count++;
                    if (duplicate_count == 1)
                        temp.Add(emp_names[i]);
                }
            }
            newemp1 = temp;
        }
        string[] newemp = newemp1.ToArray();  //assigning into a string array
        Array.Sort(newemp);
        File.WriteAllLines("D:\Employee Names.txt", newemp); //now writing the data to a text file
        Console.ReadLine();
    }
}

C# 从文本文件中删除重复的行？

提问by Goober

采纳答案by Jon Skeet

回答by Darin Dimitrov

回答by Kelly Gendron

回答by Steve

回答by DeepakTheGeek

相关推荐

最近更新

标签

C# 从文本文件中删除重复的行？

提问by Goober

采纳答案by Jon Skeet

回答by Darin Dimitrov

回答by Kelly Gendron

回答by Steve

回答by DeepakTheGeek

相关推荐

在 C# 中将字节转换为 GB？

Linux Shell：将 stdout 重定向到 /dev/null 并将 stderr 重定向到 stdout

Linux如何从用户空间访问物理地址？

有什么方法可以使用 c# 在 Windows 中关闭“互联网”？

相关推荐

最近更新

标签