包含单词的Python提取语句

我想从文本中提取包含指定单词的所有句子。包含单词的Python提取语句

txt="I like to eat apple. Me too. Let's go buy some apples." 
txt = "." + txt 
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

，但它返回我：的

[".I like to eat apple. Me too. Let's go buy some apples."]

代替：

[".I like to eat apple., "Let's go buy some apples."]

任何帮助吗？

来源

2013-04-16 user2187202

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]

来源

2013-04-16 09:09:20 Kent

您可以使用str.split，

>>> txt="I like to eat apple. Me too. Let's go buy some apples." 
>>> txt.split('. ') 
['I like to eat apple', 'Me too', "Let's go buy some apples."] 

>>> [ t for t in txt.split('. ') if 'apple' in t] 
['I like to eat apple', "Let's go buy some apples."]

来源

2013-04-16 09:06:27

In [7]: import re 

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples." 

In [9]: re.findall(r'([^.]*apple[^.]*)', txt) 
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

但需要注意的是@ jamylak的split为基础的解决方案是更快：

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt) 
1000000 loops, best of 3: 1.96 us per loop 

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s] 
1000000 loops, best of 3: 819 ns per loop

速度差异较小，但仍然显著，对于较大字符串：

In [24]: txt = txt*10000 

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt) 
100 loops, best of 3: 8.49 ms per loop 

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s] 
100 loops, best of 3: 6.35 ms per loop

来源

2013-04-16 09:07:00 unutbu

+1不错的答案！如果你创建一个'txt = txt * 10000'，那么'％timeit'结果会更清晰 – Kent

谢谢Kent。我为更大的字符串添加了'％timeit'基准。 – unutbu

无需正则表达式：

>>> txt = "I like to eat apple. Me too. Let's go buy some apples." 
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence] 
['I like to eat apple.', " Let's go buy some apples."]

来源

2013-04-16 09:07:14 jamylak

谢谢jamylak – user2187202

@ user2187202你可以接受我的答案，如果你想要或接受正则表达式的解决方案，如果这实际上是你所需要的，因为你确实把它标记为正则表达式问题，我不确定这是否是必要的或不 – jamylak

r"\."+".+"+"apple"+".+"+"\."

这条线是一个有点古怪;为什么连接这么多单独的字符串？你可以使用r'.. + apple。+。'。

无论如何，你的正则表达式的问题是它的贪婪。默认情况下，x+将尽可能多地匹配x。所以你的.+将尽可能匹配尽可能多的字符（任何字符）;包括点和apple s。

你想使用的是一个非贪婪的表达式;您通常可以通过在末尾添加?来完成此操作：.+?。

这会让你得到以下结果：

['.I like to eat apple. Me too.']

正如你所看到的你不再同时获得苹果的句子，但仍是Me too.。这是因为您仍然匹配apple之后的.，因此无法捕捉下面的句子。

一个工作正则表达式将是这样：r'\.[^.]*?apple[^.]*?\.'

在这里，你不看就在任何字符，但只有那些不是字符圆点自己。我们也允许不匹配任何字符（因为在第一句中apple之后没有非点字符）。使用表达式的结果是：

['.I like to eat apple.', ". Let's go buy some apples."]

来源

2013-04-16 09:11:56 poke

显然，有问题的样品是extract sentence containing substring而不是
extract sentence containing word。如何通过python解决extract sentence containing word问题如下：

一句话可以在句子的开头。不限于问题的例子，我将提供一个例句检索词的一般功能：

def searchWordinSentence(word,sentence): 
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $') 
    if re.search(pattern,sentence): 
     return True

仅限于问题的例子，我们可以解决，如：

txt="I like to eat apple. Me too. Let's go buy some apples." 
word = "apple" 
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

相应的输出是：

['I like to eat apple']

来源

2017-12-13 09:00:22

包含单词的Python提取语句

回答

相关问题