re.match与re.search性能差异

我试图使用timeit模块比较re.match和re.search，我发现匹配比搜索更好，当我想要找到的字符串位于字符串的开头。re.match与re.search性能差异

>>> s1 = ''' 
... import re 
... re.search(r'hello','helloab'*100000) 
... ''' 
>>> timeit.timeit(stmt=s1,number=10000) 
32.12064480781555 


>>> s = ''' 
... import re 
... re.match(r'hello','helloab'*100000) 
... ''' 
>>> timeit.timeit(stmt=s,number=10000) 
30.9136700630188

现在，我知道比赛会在该字符串的开头模式，如果找到返回一个对象，但我想知道是如何搜索工作。

在开始找到字符串后，搜索是否会执行任何额外的匹配，从而减慢搜索速度？

更新

使用@大卫罗宾逊代码后我得到的结果simlar他。

>>> print timeit.timeit(stmt="r.match('hello')", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
49.9567620754 
>>> print timeit.timeit(stmt="r.search('hello')", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
35.6694438457

因此，更新后的问题是，为什么现在search在业绩match？

来源

2012-10-09 RanRag

你真的应该在'setup'参数中为'timeit'构建测试字符串和正则表达式对象，这样你只能测量匹配/搜索本身。并且做10次以上的迭代。 – interjay

@interjay：增加到10000。试过'号码= 100000'，但在我的上网本上花费了太多时间。 – RanRag

...但你没有做我说的其他事情，所以你没有测量任何重要的问题（大部分时间都花在构建字符串上）。 – interjay

在我的机器上（Python 2.7.3在Mac OS 10.7.3,1.7 GHz Intel Core i5上），在完成字符串构建，导入re和正则表达式编译后，执行10000000次迭代而不是10次，我觉得正好相反：

import timeit 

print timeit.timeit(stmt="r.match(s)", 
      setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
      number = 10000000) 
# 6.43165612221 
print timeit.timeit(stmt="r.search(s)", 
      setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
      number = 10000000) 
# 3.85176897049

来源

2012-10-09 15:55:35

现在，这很有趣。 – RanRag

我确认了这个结果。（OS X 10.5.8，python 2.7.3）我发现“搜索”速度快了大约25％。 – mgilson

正如编辑中指出的那样，如果在设置语句中完成正则表达式编译，我发现差距会增大。 –

“因此，更新后的问题是，为什么现在搜索出表现是否匹配？”

在一个字符串使用，而不是一个正则表达式这种特殊的情况下，确实re.search比re.match默认CPython的实现（我还没有在Python的其他化身测试这一点）稍快。

>>> print timeit.timeit(stmt="r.match(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
3.29107403755 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hello')", 
...    number = 10000000) 
2.39184308052

展望C code behind those modules，它出现在搜索代码有一个内置的优化to quickly match patterns prefixed with a string lateral。在上面的示例中，整个模式是一个没有正则表达式模式的文字字符串，所以这个优化的例程用于匹配整个模式。

通知一旦我们介绍的正则表达式的符号和，作为文字串前缀变短的性能如何降低：

>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hell.')", 
...    number = 10000000) 

3.20765399933 
>>> 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')", 
...    number = 10000000) 
3.31512498856 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')", 
...    number = 10000000) 
3.31983995438 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')", 
...    number = 10000000) 
3.39261603355

对于包含正则表达式模式的图案的部分，SRE_MATCH被用于确定匹配。这与re.match背后的代码基本相同。

如果模式以正则表达式模式而非文字字符串开头，请注意结果如何接近（re.match稍快）。

>>> print timeit.timeit(stmt="r.match(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('.ello')", 
...    number = 10000000) 
3.22782492638 
>>> print timeit.timeit(stmt="r.search(s)", 
...    setup="import re; s = 'helloab'*100000; r = re.compile('.ello')", 
...    number = 10000000) 
3.31773591042

换句话说，忽略这一事实search和match具有不同的目的，re.search比re.match更快仅当模式是文字串。

当然，如果您使用的是文字字符串，您可能更愿意使用字符串操作。

>>> # Detecting exact matches 
>>> print timeit.timeit(stmt="s == r", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
0.339027881622 

>>> # Determine if string contains another string 
>>> print timeit.timeit(stmt="s in r", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
0.479326963425 


>>> # detecting prefix 
>>> print timeit.timeit(stmt="s.startswith(r)", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
1.49393510818 
>>> print timeit.timeit(stmt="s[:len(r)] == r", 
...    setup="s = 'helloab'*100000; r = 'hello'", 
...    number = 10000000) 
1.21005606651

来源

2012-10-10 09:09:01

神奇的研究和实验。特别感谢您提供源代码的链接。 – pswaminathan

re.match与re.search性能差异

回答

相关问题