Python的正则表达式：的findall（）和搜索（）

>>> p = re.compile(r"(\b\w+)\s+\1")

\b ：字边界
\w+ ：一个或多个字母数字字符
可以是，\t，\n，..）
\1 ：反向引用到组1（= (..)之间的部分）

此正则表达式应该找到一个单词的所有双OCCURENCES - 如果两个OCCURENCES是彼此相邻，两者之间有一些空白。
正则表达式似乎使用搜索功能时，做工精细：

>>> p.search("I am in the the car.") 

<_sre.SRE_Match object; span=(8, 15), match='the the'>

找到的匹配是the the，正如我所预料的。怪异的行为是在的findall功能：现在

>>> p.findall("I am in the the car.") 

['the']

的发现对手只有the。为什么区别？

2017-04-17 K.Mulier

因为'findall'只返回捕获组（或否则完整匹配）。 –

https://docs.python.org/3/library/re.html#re.findall“如果模式中存在一个或多个组，请返回组列表” – melpomene

哦，现在我明白了。谢谢。所以我必须使用一个非捕获组来解决这个问题？我现在就试试看。 –

在正则表达式中使用组时，findall()只返回组;从documentation：

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

你不能避免使用反向引用时使用组，但您可以把新组围绕整个模式：

>>> p = re.compile(r"((\b\w+)\s+\2)") 
>>> p.findall("I am in the the car.") 
[('the the', 'the')]

外组为1组，所以反向引用应指向组2.您现在有两个组，因此每个条目有两个结果。使用一组命名可能使这个更具可读性：

>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")

可以筛选回到刚才外组的结果：如果有任何

>>> [m[0] for m in p.findall("I am in the the car.")] 
['the the']

2017-04-17 14:31:15

很好的答案！谢谢Martijn :-) –

回答