比较linux终端中的两个文件

Question

提问by Ali Imran

There are two files called "a.txt"and "b.txt"both have a list of words. Now I want to check which words are extra in "a.txt"and are not in "b.txt".

有两个名为“a.txt”和“b.txt”的文件都有一个单词列表。现在我想检查哪些单词在"a.txt" 中是多余的，而不是在"b.txt" 中。

I need a efficient algorithm as I need to compare two dictionaries.

我需要一个有效的算法，因为我需要比较两个字典。

Answer 1

采纳答案by Ali Imran

Here is my solution for this :

这是我的解决方案：

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

Answer 2

回答by Anders Johansson

Sort them and use comm:

对它们进行排序并使用comm：

comm -23 <(sort a.txt) <(sort b.txt)

commcompares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2and/or -3you can suppress the corresponding output. Therefore comm -23 a blists only the entries that are unique to a. I use the <(...)syntax to sort the files on the fly, if they are already sorted you don't need this.

comm比较（排序）输入文件并默认输出三列：a 独有的行、b 独有的行以及两者中都存在的行。通过指定-1,-2和/或-3您可以抑制相应的输出。因此comm -23 a b只列出了 a 唯一的条目。我使用<(...)语法对文件进行动态排序，如果它们已经排序，则不需要它。

Answer 3

回答by Manjula

You can use difftool in linux to compare two files. You can use --changed-group-formatand --unchanged-group-formatoptions to filter required data.

您可以使用difflinux 中的工具来比较两个文件。您可以使用--changed-group-format和--unchanged-group-format选项来过滤所需的数据。

Following three options can use to select the relevant group for each option:

以下三个选项可用于为每个选项选择相关组：

'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.

'%<' 从 FILE1 中获取行
'%>' 从 FILE2 中获取行
''（空字符串）用于从两个文件中删除行。

E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

例如：diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

[root@vmoracle11 tmp]# cat file1.txt 
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt 
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
test two
test four
test eight

Answer 4

回答by Chris Seymour

Use comm -13(requires sorted files):

使用comm -13（需要排序文件）：

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four

Answer 5

回答by Fengya Li

if you have vim installed,try this:

如果你安装了 vim，试试这个：

vimdiff file1 file2

or

或者

vim -d file1 file2

you will find it fantastic. enter image description here

你会发现它很棒。在此处输入图片说明

Answer 6

回答by mudrii

Try sdiff(man sdiff)

试试sdiff( man sdiff)

sdiff -s file1 file2

Answer 7

回答by FindlinuxOne

You can also use: colordiff: Displays the output of diff with colors.

您还可以使用：colordiff：用颜色显示 diff 的输出。

About vimdiff: It allows you to compare files via SSH, for example :

关于vimdiff：它允许您通过 SSH 比较文件，例如：

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

摘自：http: //www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

Answer 8

回答by joelostblom

If you prefer the diff output style from git diff, you can use it with the --no-indexflag to compare files not in a git repository:

如果您更喜欢 diff 输出样式git diff，您可以将它与--no-index标志一起使用来比较不在 git 存储库中的文件：

git diff --no-index a.txt b.txt

Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:

使用几个包含大约 20 万个文件名字符串的文件，我对time这种方法与这里的其他一些答案进行了基准测试（使用内置命令）：

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

commseems to be the fastest by far, while git diff --no-indexappears to be the fastest approach for diff-style output.

comm到目前为止似乎是最快的，而git diff --no-index似乎是差异式输出的最快方法。

Update 2018-03-25You can actually omit the --no-indexflag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:

更新 2018-03-25--no-index除非您在 git 存储库中并想要比较该存储库中未跟踪的文件，否则您实际上可以省略该标志。从手册页：

This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

这种形式是比较文件系统上给定的两个路径。在 Git 控制的工作树中运行命令并且至少有一个路径指向工作树之外，或者在 Git 控制的工作树之外运行命令时，您可以省略 --no-index 选项。

Answer 9

回答by Iurii Golskyi

Also, do not forget about mcdiff- Internal diff viewer of GNU Midnight Commander.

另外，不要忘记mcdiff- GNU Midnight Commander 的内部差异查看器。

For example:

例如：

mcdiff file1 file2

Enjoy!

享受！

Answer 10

回答by James Brown

Using awk for it. Test files:

使用awk。测试文件：

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

The awk:

awk：

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[four
four
]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($ awk '
NR==FNR {
    seen[four
]
    next
}
!($ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
 in seen) {              # if word is not hashed to seen
    seen[awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt
]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt
 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

Duplicates are outputed:

输出重复项：

four
five,six

To avoid duplicates, add each newly met word in a.txt to seenhash:

为避免重复，将 a.txt 中每个新遇到的单词添加到seen哈希中：

##代码##

Output:

输出：

##代码##

If the word lists are comma-separated, like:

如果单词列表以逗号分隔，例如：

##代码##

you have to do a couple of extra laps (forloops):

你必须多做几圈（for循环）：

##代码##

Output this time:

这次输出：

##代码##

比较linux终端中的两个文件

提问by Ali Imran

采纳答案by Ali Imran

回答by Anders Johansson

回答by Manjula

回答by Chris Seymour

回答by Fengya Li

回答by mudrii

回答by FindlinuxOne

回答by joelostblom

回答by Iurii Golskyi

回答by James Brown

相关推荐

最近更新

标签

比较linux终端中的两个文件

提问by Ali Imran

采纳答案by Ali Imran

回答by Anders Johansson

回答by Manjula

回答by Chris Seymour

回答by Fengya Li

回答by mudrii

回答by FindlinuxOne

回答by joelostblom

回答by Iurii Golskyi

回答by James Brown

相关推荐

Linux 您在 CentOS 中的 Apache 中没有权限错误

C# 按列对 ListView 进行排序

Linux 使用 awk 替换正则表达式模式

C# 扩展方法可以访问私有变量吗？

相关推荐

最近更新

标签