2017-01-24 32 views
1

我想根据词性来词串化,但在最后阶段,我收到一个错误。我的代码:引理字符串根据pos nlp

import nltk 
from nltk.stem import * 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.corpus import wordnet 
wordnet_lemmatizer = WordNetLemmatizer() 
text = word_tokenize('People who help the blinging lights are the way of the future and are heading properly to their goals') 
tagged = nltk.pos_tag(text) 

def get_wordnet_pos(treebank_tag): 

    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return '' 

for word in tagged: print(wordnet_lemmatizer.lemmatize(word,pos='v'), end=" ") 
--------------------------------------------------------------------------- 
AttributeError       Traceback (most recent call last) 
<ipython-input-40-afb22c78f770> in <module>() 
----> 1 for word in tagged: print(wordnet_lemmatizer.lemmatize(word,pos='v'), end=" ") 

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\stem\wordnet.py in lemmatize(self, word, pos) 
    38 
    39  def lemmatize(self, word, pos=NOUN): 
---> 40   lemmas = wordnet._morphy(word, pos) 
    41   return min(lemmas, key=len) if lemmas else word 
    42 

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in _morphy(self, form, pos) 
    1710 
    1711   # 1. Apply rules once to the input to get y1, y2, y3, etc. 
-> 1712   forms = apply_rules([form]) 
    1713 
    1714   # 2. Return all that are in the database (and check the original too) 

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in apply_rules(forms) 
    1690   def apply_rules(forms): 
    1691    return [form[:-len(old)] + new 
-> 1692      for form in forms 
    1693      for old, new in substitutions 
    1694      if form.endswith(old)] 

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in <listcomp>(.0) 
    1692      for form in forms 
    1693      for old, new in substitutions 
-> 1694      if form.endswith(old)] 
    1695 
    1696   def filter_forms(forms): 

我希望能够基于每个单词的词类一次性地解析该字符串。请帮忙。

+0

我不完全理解你的方法:你想在检查他们的POS之后推理词汇,以确保你得到正确的引理,是吗?如果是这样,你能否给出预期的投入和产出?另外,'get_wordnet_pos()'有什么意义 - 我没有看到它在任何地方使用? – patrick

+0

看看https://gist.github.com/alvations/07758d02412d928414bb – alvas

回答

0

首先,尽量不要混用顶级,绝对和相对进口这样的:

import nltk 
from nltk.stem import * 
from nltk import pos_tag, word_tokenize 

这会更好:

from nltk import sent_tokenize, word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn 

(见Absolute vs. explicit relative import of Python module

错误你得到的很可能是因为你在输入pos_tag作为WordNetLemmatizer.lemmatize()的输入,即:

>>> from nltk import pos_tag 
>>> from nltk.stem import WordNetLemmatizer 

>>> wnl = WordNetLemmatizer() 
>>> sent = 'People who help the blinging lights are the way of the future and are heading properly to their goals'.split() 

>>> pos_tag(sent) 
[('People', 'NNS'), ('who', 'WP'), ('help', 'VBP'), ('the', 'DT'), ('blinging', 'NN'), ('lights', 'NNS'), ('are', 'VBP'), ('the', 'DT'), ('way', 'NN'), ('of', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('are', 'VBP'), ('heading', 'VBG'), ('properly', 'RB'), ('to', 'TO'), ('their', 'PRP$'), ('goals', 'NNS')] 
>>> pos_tag(sent)[0] 
('People', 'NNS') 

>>> first_word = pos_tag(sent)[0] 
>>> wnl.lemmatize(first_word) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/lib/python2.7/dist-packages/nltk/stem/wordnet.py", line 40, in lemmatize 
    lemmas = wordnet._morphy(word, pos) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1712, in _morphy 
    forms = apply_rules([form]) 
    File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1694, in apply_rules 
    if form.endswith(old)] 
AttributeError: 'tuple' object has no attribute 'endswith' 

输入到WordNetLemmatizer.lemmatize()应该是str不是一个元组,因此,如果你这样做:

>>> tagged_sent = pos_tag(sent) 

>>> def penn2morphy(penntag, returnNone=False): 
...  morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ, 
...     'VB':wn.VERB, 'RB':wn.ADV} 
...  try: 
...   return morphy_tag[penntag[:2]] 
...  except: 
...   return None if returnNone else '' 
... 

>>> for word, tag in tagged_sent: 
...  wntag = penn2morphy(tag) 
...  if wntag: 
...   print wnl.lemmatize(word, pos=wntag) 
...  else: 
...   print word 
... 
People 
who 
help 
the 
blinging 
light 
be 
the 
way 
of 
the 
future 
and 
be 
head 
properly 
to 
their 
goal 

如果你喜欢一个简单的办法:

pip install pywsd 

然后:

>>> from pywsd.utils import lemmatize, lemmatize_sentence 
>>> sent = 'People who help the blinging lights are the way of the future and are heading properly to their goals' 
>>> lemmatize_sentence(sent) 
['people', 'who', 'help', 'the', u'bling', u'light', u'be', 'the', 'way', 'of', 'the', 'future', 'and', u'be', u'head', 'properly', 'to', 'their', u'goal']