使用for循环匹配列表中最短的子字符串

我想从第二个列表中的项目（完整句子）与一个列表匹配项目（单个单词）。这是我的代码：使用for循环匹配列表中最短的子字符串

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     if word in line: 
      print(word,line)

现在的问题是，我的代码输出子，所以在寻找一个'Python的发生，我也越来越“蟒蛇”的句子时;同样，当我只想要包含单词'有趣'的句子时，我越来越'有趣'。

我已经尝试在列表中的单词旁边添加空格，但这不是理想的解决方案，因为句子可能包含标点符号，并且代码不会返回匹配项。

所需的输出：
- 时间，时间就是高
- 趣味，这就是乐趣！
- Python，Python不错

来源

2016-06-30 Elle

'Fun'和''乐趣是明显不一样 –

既然你要精确匹配，它会更好地使用==，而不是in。

import string 

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     for wrd in line.split(): 
      if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd 
       print(word,line)

来源

2016-06-30 13:02:31 Lafexlos

它不是那么容易（需要更多的代码行）来实现检索“Fun！”对于Fun，同时对于Python不是“蟒蛇”。当然可以这样做，但是在这一点上我的规则不是很清楚。看看这虽然：

tokens = ['Time', 'Fun', 'Python'] 
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()]) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

下面你只有这一次你用好老for循环同样的事情，而不是一个列表理解。我虽然可能会帮助你理解上面的代码更容易。

a = [] 
for phrase in sentences: 
    words_in_phrase = phrase.split() 
    for words in tokens: 
     if words in words_in_phrase: 
      a.append((words, phrase)) 
print(a) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

这里发生的事情是，代码返回找到的字符串和它在其中找到的短语。这样做的方式是将sentence列表中的短语拆分为空白。所以“Pythons”和“Python”与你想要的不一样，但“Fun！”也是如此。和乐趣”。这也是区分大小写的。

来源

2016-06-30 12:53:30

您可能想要使用动态生成的正则表达式，即对于“Python”，正则表达式看起来像'\ bPython \ b'。 '\ b'是一个字边界。

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

import re 
for word in tokens: 
    regexp = re.compile('\b' + word + '\b') 
    for line in sentences: 
     if regexp.match(line): 
      print(line) 
      print(word,line)

来源

2016-06-30 13:00:23

标记句子更好，然后按空格拆分它，因为标记化将分隔标点符号。

例如：

sentence = 'this is a test.' 
>>> 'test' in 'this is a test.'.split(' ') 
False 
>>> nltk.word_tokenize('this is a test.') 
['this', 'is', 'a', 'test','.']

代码：

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 
import nltk 
for sentence in sentences: 
    for token in tokens: 
     if token in nltk.word_tokenize(sentence): 
      print token,sentence

来源

2016-06-30 13:40:05 galaxyan

为什么你的代码的工作！？考虑在答案中添加一些上下文。 – ppperry

使用for循环匹配列表中最短的子字符串

回答

相关问题