WordNetLemmatizer不返回正确的引理，除非POS是明确的 - Python NLTK

我将引用特德数据集抄本。我注意到了一些奇怪的东西：并非所有的单词都被词性化。说，WordNetLemmatizer不返回正确的引理，除非POS是明确的 - Python NLTK

selected -> select

这是正确的。

但是，involved !-> involve和horsing !-> horse除非我明确输入'v'（动词）属性。

蟒终端，我得到正确的输出，但不是在我的code：

>>> from nltk.stem import WordNetLemmatizer 
>>> from nltk.corpus import wordnet 
>>> lem = WordNetLemmatizer() 
>>> lem.lemmatize('involved','v') 
u'involve' 
>>> lem.lemmatize('horsing','v') 
u'horse'

代码的相关部分是这样的：

for l in LDA_Row[0].split('+'): 
    w=str(l.split('*')[1]) 
    word=lmtzr.lemmatize(w) 
    wordv=lmtzr.lemmatize(w,'v') 
    print wordv, word 
    # if word is not wordv: 
    # print word, wordv

整个代码here。

什么问题？

来源

2015-10-05 FlyingAura

代码是不工作没有安装...你能提取输入，例如LDA_Row是怎样的？ – rebeling

这是因为你的POS标签是错的。 P/S：下一次，请尽量不要发布完整的代码，但在代码中含有解释问题的片段，否则，Stackoverflow用户可能会试图关闭“问题不清楚”的问题，或者这是“我的代码不起作用“question =） – alvas

的lemmatizer需要正确的POS标签是准确的，如果你使用WordNetLemmatizer.lemmatize()的默认设置，默认标签名词，看https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

要解决此问题，始终POS标签lemmatizing之前您的数据，例如

>>> from nltk.stem import WordNetLemmatizer 
>>> from nltk import pos_tag, word_tokenize 
>>> wnl = WordNetLemmatizer() 
>>> sent = 'This is a foo bar sentence' 
>>> pos_tag(word_tokenize(sent)) 
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')] 
>>> for word, tag in pos_tag(word_tokenize(sent)): 
...  wntag = tag[0].lower() 
...  wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None 
...  if not wntag: 
...    lemma = word 
...  else: 
...    lemma = wnl.lemmatize(word, wntag) 
...  print lemma 
... 
This 
be 
a 
foo 
bar 
sentence

注意， '是 - >是'，即

>>> wnl.lemmatize('is') 
'is' 
>>> wnl.lemmatize('is', 'v') 
u'be'

要回答与你的例子的话问题：

>>> sent = 'These sentences involves some horsing around' 
>>> for word, tag in pos_tag(word_tokenize(sent)): 
...  wntag = tag[0].lower() 
...  wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None 
...  lemma = wnl.lemmatize(word, wntag) if wntag else word 
...  print lemma 
... 
These 
sentence 
involve 
some 
horse 
around

注意，有一些怪癖与WordNetLemmatizer：

而且NLTK的默认POS恶搞正在持续的一些重大变化，以提高准确性：

而对于一个现成的解决方案lemmatizer，你可以看看https://github.com/alvations/pywsd，我怎么做了一些尝试，除了捕捉不在WordNet中的单词，请参阅https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

来源

2015-10-06 00:22:29 alvas

非常有帮助，谢谢！ – FlyingAura

WordNetLemmatizer不返回正确的引理，除非POS是明确的 - Python NLTK

回答

相关问题