2012-08-10 17 views
3

我通常使用grep -rIn pattern_str big_source_code_dir找到一些东西。但grep不是平行的,我如何使它平行?我的系统有4个内核,如果grep可以使用所有内核,则速度会更快。我如何并行grep

+1

有一个新的开源软件项目http://international-characters.com/icgrep是一个“并行比特流实施”。我还没有试过软件,但速度可能会很快。 – 2014-07-20 09:07:50

回答

7

如果您使用HDD来存储您正在搜索的目录,则速度不会提高。硬盘驱动器几乎是单线程访问单元。

但是,如果你真的想要做并行grep,那么this website给出了两个提示,如何使用findxargs。例如。

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar 
+0

-bash:parallel:命令未找到 – Satish 2012-08-10 19:02:28

+0

我从源网站复制了错误的示例,很抱歉。我会解决答案。 – Ilya 2012-08-12 17:08:00

+0

请注意,使用'xargs'您可能会产生混合输出。要看到这个行动,请参阅:http://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel – 2012-08-20 08:40:20

0

GNU parallel命令对此非常有用。

sudo apt-get install parallel # if not available on debian based systems 

然后,paralell手册页提供了一个例子:

EXAMPLE: Parallel grep 
     grep -r greps recursively through directories. 
     On multicore CPUs GNU parallel can often speed this up. 

     find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {} 

     This will run 1.5 job per core, and give 1000 arguments to grep. 

你的情况可能是:

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {} 

最后,GNU平行手册页还提供了描述差异的部分在xargsparallel命令之间,这应该有助于理解为什么并行在您的情况下似乎更好

DIFFERENCES BETWEEN xargs AND GNU Parallel 
     xargs offer some of the same possibilities as GNU parallel. 

     xargs deals badly with special characters (such as space, ' and "). To see the problem try this: 

     touch important_file 
     touch 'not important_file' 
     ls not* | xargs rm 
     mkdir -p "My brother's 12\" records" 
     ls | xargs rmdir 

     You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n), 
     locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z). 

     So GNU parallel's newline separation can be emulated with: 

     cat | xargs -d "\n" -n1 command 

     xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel. 

     xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be 
     done reliably with xargs because of this. 
     ... 
+0

我不同意。当grepping时,你的限制因素是IO吞吐量,而不是CPU时间。在问题中投掷更多内核不会让您的磁盘更快旋转。 – Sobrique 2016-01-06 12:59:04

+1

我不同意你的看法: #time grep -E'无效用户(\ S +)from([0-9] + \。[0-9] + \。[0-9] + \ [0 -9] +)端口([0-9] +)” /var/log/auth.log 显示在我的i710秒 然后测试驱动器的速度: #DD如果= /无功/日志/ auth.log of =/dev/null bs = 1M 在130MB /秒时为600MB提供4秒 但是上面的grep花费了3次,接近40MB/sec读取数据。 因此,这里正则表达式的处理时间是最广泛的 并行运行: 并行 - 管道 - 块16M grep -E从 ([0-9] + \ ['S + 0-9] + \。[0-9] +)端口([0-9] +)' MordicusEtCubitus 2016-01-06 13:15:57

1

请注意,你需要在你的grep并行搜索词转义特殊字符,例如:

parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml

使用独立的grep,grep -F 'PostTypeId="1"'将工作没有逃脱双引号。我花了一段时间才弄明白这一点!

还要注意使用LC_ALL=C-F标志(如果您只是搜索完整字符串)以获得更多加速。