2016-06-30 21 views
0

我想从第二个列表中的项目(完整句子)与一个列表匹配项目(单个单词)。这是我的代码:使用for循环匹配列表中最短的子字符串

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     if word in line: 
      print(word,line) 

现在的问题是,我的代码输出子,所以在寻找一个'Python的发生,我也越来越“蟒蛇”的句子时;同样,当我只想要包含单词'有趣'的句子时,我越来越'有趣'。

我已经尝试在列表中的单词旁边添加空格,但这不是理想的解决方案,因为句子可能包含标点符号,并且代码不会返回匹配项。

所需的输出:
- 时间,时间就是高
- 趣味,这就是乐趣!
- Python,Python不错

+0

'Fun'和''乐趣是明显不一样 –

回答

0

既然你要精确匹配,它会更好地使用==,而不是in

import string 

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     for wrd in line.split(): 
      if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd 
       print(word,line) 
0

它不是那么容易(需要更多的代码行)来实现检索“Fun!”对于Fun,同时对于Python不是“蟒蛇”。当然可以这样做,但是在这一点上我的规则不是很清楚。看看这虽然:

tokens = ['Time', 'Fun', 'Python'] 
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()]) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')] 

下面你只有这一次你用好老for循环同样的事情,而不是一个列表理解。我虽然可能会帮助你理解上面的代码更容易。

a = [] 
for phrase in sentences: 
    words_in_phrase = phrase.split() 
    for words in tokens: 
     if words in words_in_phrase: 
      a.append((words, phrase)) 
print(a) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')] 

这里发生的事情是,代码返回找到的字符串和它在其中找到的短语。这样做的方式是将sentence列表中的短语拆分为空白。所以“Pythons”和“Python”与你想要的不一样,但“Fun!”也是如此。和乐趣”。这也是区分大小写的。

0

您可能想要使用动态生成的正则表达式,即对于“Python”,正则表达式看起来像'\ bPython \ b'。 '\ b'是一个字边界。

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

import re 
for word in tokens: 
    regexp = re.compile('\b' + word + '\b') 
    for line in sentences: 
     if regexp.match(line): 
      print(line) 
      print(word,line) 
0

标记句子更好,然后按空格拆分它,因为标记化将分隔标点符号。

例如:

sentence = 'this is a test.' 
>>> 'test' in 'this is a test.'.split(' ') 
False 
>>> nltk.word_tokenize('this is a test.') 
['this', 'is', 'a', 'test','.'] 

代码:

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 
import nltk 
for sentence in sentences: 
    for token in tokens: 
     if token in nltk.word_tokenize(sentence): 
      print token,sentence 
+0

为什么你的代码的工作!?考虑在答案中添加一些上下文。 – ppperry