Linux 计算大文件中的行数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12716570/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 14:29:51  来源:igfitidea点击:

Count lines in large files

linuxmapreduce

提问by Dnaiel

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.

我通常使用大约 20 Gb 大小的文本文件,我发现自己经常计算给定文件中的行数。

The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster?

我现在这样做的方式只是cat fname | wc -l,而且需要很长时间。有没有更快的解决方案?

I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help.

我在安装了 Hadoop 的高性能集群中工作。我想知道 map reduce 方法是否有帮助。

I'd like the solution to be as simple as one line run, like the wc -lsolution, but not sure how feasible it is.

我希望解决方案像一行运行一样简单,就像wc -l解决方案一样,但不确定它的可行性。

Any ideas?

有任何想法吗?

采纳答案by P.P

Try: sed -n '$=' filename

尝试: sed -n '$=' filename

Also cat is unnecessary: wc -l filenameis enough in your present way.

cat 也是不必要的:wc -l filename以您目前的方式就足够了。

回答by lvella

Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.

您的限制速度因素是您的存储设备的 I/O 速度,因此在简单的换行符/模式计数程序之间进行更改将无济于事,因为这些程序之间的执行速度差异可能会被较慢的磁盘/存储/不管你有什么。

But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:

但是,如果您在磁盘/设备之间复制了相同的文件,或者文件分布在这些磁盘之间,则当然可以并行执行操作。我并不特别了解这个 Hadoop,但假设您可以从 4 个不同的位置读取 10gb 的文件,您可以运行 4 个不同的行计数进程,每个进程都在文件的一个部分,并将它们的结果总结起来:

$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &

Notice the &at each command line, so all will run in parallel; ddworks like cathere, but allow us to specify how many bytes to read (count * bsbytes) and how many to skip at the beginning of the input (skip * bsbytes). It works in blocks, hence, the need to specify bsas the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.

注意&每个命令行,所以所有的都将并行运行;ddcat这里一样工作,但允许我们指定要读取的count * bs字节数(字节)以及在输入开头跳过的skip * bs字节数(字节)。它在块中工作,因此需要指定bs块大小。在此示例中,我将 10Gb 文件划分为 4 个相等的 4Kb * 655360 = 2684354560 字节 = 2.5GB,分配给每个作业,您可能需要设置一个脚本来根据文件的大小为您执行此操作文件和您将运行的并行作业数。您还需要总结执行的结果,这是我由于缺乏 shell 脚本能力而没有做的。

If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.

如果您的文件系统足够智能,可以在许多设备(如 RAID 或分布式文件系统或其他东西)之间拆分大文件,并自动并行化可以并行化的 I/O 请求,那么您可以执行这样的拆分,运行许多并行作业,但使用相同的文件路径,你仍然可能有一些速度增益。

EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.

编辑:我想到的另一个想法是,如果文件内的行具有相同的大小,您可以通过将文件的大小除以行的大小(均以字节为单位)来获得确切的行数。您几乎可以在一次工作中立即完成。如果您有平均大小并且不完全关心行数,但想要估计,您可以执行相同的操作并获得比精确操作快得多的满意结果。

回答by Chris White

Hadoop is essentially providing a mechanism to perform something similar to what @Ivella is suggesting.

Hadoop 本质上提供了一种机制来执行类似于@Ivella 所建议的事情。

Hadoop's HDFS (Distributed file system) is going to take your 20GB file and save it across the cluster in blocks of a fixed size. Lets say you configure the block size to be 128MB, the file would be split into 20x8x128MB blocks.

Hadoop 的 HDFS(分布式文件系统)将采用您的 20GB 文件并将其以固定大小的块保存在整个集群中。假设您将块大小配置为 128MB,文件将被拆分为 20x8x128MB 块。

You would then run a map reduce program over this data, essentially counting the lines for each block (in the map stage) and then reducing these block line counts into a final line count for the entire file.

然后,您将对这些数据运行 map reduce 程序,基本上计算每个块的行数(在 map 阶段),然后将这些块的行数减少到整个文件的最终行数。

As for performance, in general the bigger your cluster, the better the performance (more wc's running in parallel, over more independent disks), but there is some overhead in job orchestration that means that running the job on smaller files will not actually yield quicker throughput than running a local wc

至于性能,通常集群越大,性能越好(并行运行的 wc 越多,独立磁盘越多),但作业编排中存在一些开销,这意味着在较小的文件上运行作业实际上不会产生更快的结果吞吐量比运行本地 wc

回答by ZenOfPython

If your computer has python, you can try this from the shell:

如果您的计算机有 python,您可以从 shell 中尝试:

python -c "print len(open('test.txt').read().split('\n'))"

This uses python -cto pass in a command, which is basically reading the file, and splitting by the "newline", to get the count of newlines, or the overall length of the file.

这用于python -c传入一个命令,该命令基本上是读取文件,并通过“换行符”进行拆分,以获取换行符的数量或文件的总长度。

@BlueMoon's:

@蓝月亮的

bash-3.2$ sed -n '$=' test.txt
519

Using the above:

使用上述:

bash-3.2$ python -c "print len(open('test.txt').read().split('\n'))"
519

回答by Pirooz

If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:

如果您的数据驻留在 HDFS 上,也许最快的方法是使用 hadoop 流。Apache Pig 的 COUNT UDF 对包进行操作,因此使用单个 reducer 来计算行数。相反,您可以在一个简单的 hadoop 流脚本中手动设置减速器的数量,如下所示:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"

Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:

请注意,我手动将减速器的数量设置为 100,但您可以调整此参数。一旦 map-reduce 作业完成,每个 reducer 的结果就会存储在一个单独的文件中。最终的行数是所有减速器返回的数字之和。您可以获得最终的行数,如下所示:

$HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc

回答by eugene

I'm not sure that python is quicker:

我不确定 python 是否更快:

[root@myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))"

644306


real    0m0.310s
user    0m0.176s
sys     0m0.132s

[root@myserver scripts]# time  cat mybigfile.txt  | wc -l

644305


real    0m0.048s
user    0m0.017s
sys     0m0.074s

回答by Ceaser Ashton-Bradley Junior

find  -type f -name  "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print 
find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc
, system("cat "
find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc
"|" "wc -l")}'

Output:

输出:

回答by Nicholas Sushkin

On a multi-core server, use GNU parallelto count file lines in parallel. After each files line count is printed, bc sums all line counts.

在多核服务器上,使用GNU parallel 并行计算文件行数。打印每个文件的行数后, bc 将所有行数相加。

time grep -c $ my_file.txt;

To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.

为了节省空间,您甚至可以压缩所有文件。以下行解压缩每个文件并并行计算其行数,然后对所有计数求和。

time wc -l my_file.txt;

回答by Pramod Tiwari

As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines

根据我的测试,我可以验证 Spark-Shell(基于 Scala)比其他工具(GREP、SED、AWK、PERL、WC)快得多。这是我在一个有 23782409 行的文件上运行的测试结果

time sed -n '$=' my_file.txt;

real 0m44.96s user 0m41.59s sys 0m3.09s

真实 0m44.96s 用户 0m41.59s 系统 0m3.09s

time awk 'END { print NR }' my_file.txt;

real 0m37.57s user 0m33.48s sys 0m3.97s

真实 0m37.57s 用户 0m33.48s 系统 0m3.97s

spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()

real 0m38.22s user 0m28.05s sys 0m10.14s

真实 0m38.22s 用户 0m28.05s 系统 0m10.14s

time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

real 0m23.38s user 0m20.19s sys 0m3.11s

真实 0m23.38s 用户 0m20.19s 系统 0m3.11s

##代码##

real 0m19.90s user 0m16.76s sys 0m3.12s

真实 0m19.90s 用户 0m16.76s 系统 0m3.12s

##代码##

res1: org.joda.time.Seconds = PT15S

res1:org.joda.time.Seconds = PT15S

回答by sudo

If your bottleneck is the disk, it matters how you read from it. dd if=filename bs=128M | wc -lis a lotfaster than wc -l filenameor cat filename | wc -lfor my machine that has an HDD and fast CPU and RAM. You can play around with the block size and see what ddreports as the throughput. I cranked it up to 1GiB.

如果瓶颈是磁盘,那么读取它的方式就很重要。dd if=filename bs=128M | wc -l很多快于wc -l filenamecat filename | wc -l我的机器有一个硬盘和快速的CPU和RAM。您可以使用块大小并查看dd报告的吞吐量。我把它调高到 1GiB。

Note: There is some debate about whether cator ddis faster. All I claim is that ddcan be faster, depending on the system, and that it is for me. Try it for yourself.

注意:有一个关于是否有些争论cat或者dd是更快的。我所声称的是dd可以更快,具体取决于系统,并且它适合我。自己试试吧。