2016-01-13 84 views
1

我正在测试NLTK package的词汇。我使用了下面的代码,并希望看到所有的TrueNLTK词汇中缺少单词 - Python

import nltk 

english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 

print ('answered' in english_vocab) 
print ('unanswered' in english_vocab) 
print ('altered' in english_vocab) 
print ('alter' in english_vocab) 
print ('looks' in english_vocab) 
print ('look' in english_vocab) 

但我的结果如下,这么多的话丢失了,或者说某种形式的单词的缺失?我错过了什么吗?

False 
True 
False 
True 
False 
True 

回答

2

事实上,胼不是所有的英语单词一个详尽的清单,而是一组文本。判断单词是否为有效的英文单词的更合适的方法是使用wordnet:

from nltk.corpus import wordnet as wn 

print wn.synsets('answered') 
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')] 

print wn.synsets('unanswered') 
# [Synset('unanswered.s.01')] 

print wn.synsets('notaword') 
# [] 
2

NLTK corpora实际上并没有存储每个单词,它们被定义为“大量文本”。

例如,您正在使用words语料库,我们可以通过使用其readme()方法来检查它的定义:

>>> print(nltk.corpus.words.readme()) 
Wordlists 

en: English, http://en.wikipedia.org/wiki/Words_(Unix) 
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932) 

Unix的话并不详尽,所以它可能确实是丢失了一些话。语料库本质上是不完整的(因此强调自然语言)。

话虽这么说,你可能想尝试使用从字典派生的语料库,如brown

>>> print(nltk.corpus.brown.readme()) 
BROWN CORPUS 

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. 

by W. N. Francis and H. Kucera (1964) 
Department of Linguistics, Brown University 
Providence, Rhode Island, USA 

Revised 1971, Revised and Amplified 1979 

http://www.hit.uib.no/icame/brown/bcm.html 

Distributed with the permission of the copyright holder, redistribution permitted.