一则小实验——grep与sort&comm命令

时间：2015-09-25 04:11:24 阅读：215 评论：0 收藏：0 [点我收藏+]

看到沈沉舟大牛发了条微博，自己也趁机多学习几个Linux命令和选项 :D

#分享# 有两个文本文件1.txt、2.txt，比较后求只存在于2.txt中的行。如果它们是几万行的大文件，务必先排序，后用comm，而不是用grep -vxf求只存在于2.txt中的行。实测过，前者的效率秒杀后者。"排序+comm"耗时远小于grep耗时，后者耗时超乎想像的长。

9月24日 17:51 来自微博 weibo.com

下面是我的实验小程序：

#!/usr/bin/env python
import random
import os
import time

fp1 = open(‘1.txt‘, ‘w‘)
fp2 = open(‘2.txt‘, ‘w‘)

for i in range(1, 5000):
	str1 = str(random.randint(0, 1000000)) + ‘\n‘
	fp1.write(str1)
	str2 = str(random.randint(0, 1000000)) + ‘\n‘
	fp2.write(str2)

fp1.close()
fp2.close()

t0 = time.time()
os.system(‘grep -vxf 1.txt 2.txt > comm1.txt‘)
print(time.time() - t0)

t0 = time.time()
os.system(‘grep -Fvxf 1.txt 2.txt > comm2.txt‘)
print(time.time() - t0)

t0 = time.time()
os.system(‘sort 1.txt -o 1.txt && sort 2.txt -o 2.txt && comm -13 1.txt 2.txt > comm3.txt‘)
print(time.time() - t0)

运行结果：

ez@ubuntu:~/workdir/tiny$ python _random.py 
4.72525906563
0.00785803794861
0.100791931152

其实从结果上看，加了-F选项的grep速度更快，且没有破坏数据原有顺序。看Linux手册对这个参数的说明：

-F, --fixed-strings PATTERN is a set of newline-separated fixed strings

意思是将数据看作固定字符串，而本来grep是将其看作正则表达式的，所以-F效率更高。加-F参数的grep就相当于fgrep。

一则小实验——grep与sort&comm命令

原文：http://my.oschina.net/cve2015/blog/510928

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)