我如何并行grep

我通常使用grep -rIn pattern_str big_source_code_dir找到一些东西。但grep不是平行的，我如何使它平行？我的系统有4个内核，如果grep可以使用所有内核，则速度会更快。我如何并行grep

2012-08-10 Lai Jiangshan

有一个新的开源软件项目http://international-characters.com/icgrep是一个“并行比特流实施”。我还没有试过软件，但速度可能会很快。 – 2014-07-20 09:07:50

如果您使用HDD来存储您正在搜索的目录，则速度不会提高。硬盘驱动器几乎是单线程访问单元。

但是，如果你真的想要做并行grep，那么this website给出了两个提示，如何使用find和xargs。例如。

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar

来源

2012-08-10 09:39:32 Ilya

-bash：parallel：命令未找到 – Satish 2012-08-10 19:02:28

我从源网站复制了错误的示例，很抱歉。我会解决答案。 – Ilya 2012-08-12 17:08:00

请注意，使用'xargs'您可能会产生混合输出。要看到这个行动，请参阅：http://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel – 2012-08-20 08:40:20

GNU parallel命令对此非常有用。

sudo apt-get install parallel # if not available on debian based systems

然后，paralell手册页提供了一个例子：

EXAMPLE: Parallel grep 
     grep -r greps recursively through directories. 
     On multicore CPUs GNU parallel can often speed this up. 

     find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {} 

     This will run 1.5 job per core, and give 1000 arguments to grep.

你的情况可能是：

find big_source_code_dir -type f | parallel -k -j150% -n 1000 -m grep -H -n pattern_str {}

最后，GNU平行手册页还提供了描述差异的部分在xargs和parallel命令之间，这应该有助于理解为什么并行在您的情况下似乎更好

DIFFERENCES BETWEEN xargs AND GNU Parallel 
     xargs offer some of the same possibilities as GNU parallel. 

     xargs deals badly with special characters (such as space, ' and "). To see the problem try this: 

     touch important_file 
     touch 'not important_file' 
     ls not* | xargs rm 
     mkdir -p "My brother's 12\" records" 
     ls | xargs rmdir 

     You can specify -0 or -d "\n", but many input generators are not optimized for using NUL as separator but are optimized for newline as separator. E.g head, tail, awk, ls, echo, sed, tar -v, perl (-0 and \0 instead of \n), 
     locate (requires using -0), find (requires using -print0), grep (requires user to use -z or -Z), sort (requires using -z). 

     So GNU parallel's newline separation can be emulated with: 

     cat | xargs -d "\n" -n1 command 

     xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel. 

     xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process. The example Parallel grep cannot be 
     done reliably with xargs because of this. 
     ...

来源

2016-01-06 12:57:23 MordicusEtCubitus

我不同意。当grepping时，你的限制因素是IO吞吐量，而不是CPU时间。在问题中投掷更多内核不会让您的磁盘更快旋转。 – Sobrique 2016-01-06 12:59:04

我不同意你的看法：＃time grep -E'无效用户（\ S +）from（[0-9] + \。[0-9] + \。[0-9] + \ [0 -9] +）端口（[0-9] +）” /var/log/auth.log 显示在我的i710秒然后测试驱动器的速度：＃DD如果= /无功/日志/ auth.log of =/dev/null bs = 1M 在130MB /秒时为600MB提供4秒但是上面的grep花费了3次，接近40MB/sec读取数据。因此，这里正则表达式的处理时间是最广泛的并行运行：并行 - 管道 - 块16M grep -E从（[0-9] + \ ['S + 0-9] + \。[0-9] +）端口（[0-9] +）' MordicusEtCubitus 2016-01-06 13:15:57

请注意，你需要在你的grep并行搜索词转义特殊字符，例如：

parallel --pipe --block 10M --ungroup LC_ALL=C grep -F 'PostTypeId=\"1\"' < ~/Downloads/Posts.xml > questions.xml

使用独立的grep，grep -F 'PostTypeId="1"'将工作没有逃脱双引号。我花了一段时间才弄明白这一点！

还要注意使用LC_ALL=C和-F标志（如果您只是搜索完整字符串）以获得更多加速。

来源

2017-11-03 12:09:17 Gaurav

我如何并行grep

回答

相关问题