2015-10-05 15 views
1

我将引用特德数据集抄本。我注意到了一些奇怪的东西: 并非所有的单词都被词性化。说,WordNetLemmatizer不返回正确的引理,除非POS是明确的 - Python NLTK

selected -> select 

这是正确的。

但是,involved !-> involvehorsing !-> horse除非我明确输入'v'(动词)属性。

蟒终端,我得到正确的输出,但不是在我的code

>>> from nltk.stem import WordNetLemmatizer 
>>> from nltk.corpus import wordnet 
>>> lem = WordNetLemmatizer() 
>>> lem.lemmatize('involved','v') 
u'involve' 
>>> lem.lemmatize('horsing','v') 
u'horse' 

代码的相关部分是这样的:

for l in LDA_Row[0].split('+'): 
    w=str(l.split('*')[1]) 
    word=lmtzr.lemmatize(w) 
    wordv=lmtzr.lemmatize(w,'v') 
    print wordv, word 
    # if word is not wordv: 
    # print word, wordv 

整个代码here

什么问题?

+0

代码是不工作没有安装...你能提取输入,例如LDA_Row是怎样的? – rebeling

+0

这是因为你的POS标签是错的。 P/S:下一次,请尽量不要发布完整的代码,但在代码中含有解释问题的片段,否则,Stackoverflow用户可能会试图关闭“问题不清楚”的问题,或者这是“我的代码不起作用“question =) – alvas

回答

2

的lemmatizer需要正确的POS标签是准确的,如果你使用WordNetLemmatizer.lemmatize()的默认设置,默认标签名词,看https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

要解决此问题,始终POS标签lemmatizing之前您的数据,例如

>>> from nltk.stem import WordNetLemmatizer 
>>> from nltk import pos_tag, word_tokenize 
>>> wnl = WordNetLemmatizer() 
>>> sent = 'This is a foo bar sentence' 
>>> pos_tag(word_tokenize(sent)) 
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')] 
>>> for word, tag in pos_tag(word_tokenize(sent)): 
...  wntag = tag[0].lower() 
...  wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None 
...  if not wntag: 
...    lemma = word 
...  else: 
...    lemma = wnl.lemmatize(word, wntag) 
...  print lemma 
... 
This 
be 
a 
foo 
bar 
sentence 

注意, '是 - >是',即

>>> wnl.lemmatize('is') 
'is' 
>>> wnl.lemmatize('is', 'v') 
u'be' 

要回答与你的例子的话问题:

>>> sent = 'These sentences involves some horsing around' 
>>> for word, tag in pos_tag(word_tokenize(sent)): 
...  wntag = tag[0].lower() 
...  wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None 
...  lemma = wnl.lemmatize(word, wntag) if wntag else word 
...  print lemma 
... 
These 
sentence 
involve 
some 
horse 
around 

注意,有一些怪癖与WordNetLemmatizer:

而且NLTK的默认POS恶搞正在持续的一些重大变化,以提高准确性:

而对于一个现成的解决方案lemmatizer,你可以看看https://github.com/alvations/pywsd,我怎么做了一些尝试,除了捕捉不在WordNet中的单词,请参阅https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

+0

非常有帮助,谢谢! – FlyingAura