这取决于POS标记器如何给出输入。例如: “女人需要像鱼一样的男人需要自行车”
如果您使用默认的nltk词语标记器和正则表达式标记器,这些值将会不同。
import nltk
from nltk.tokenize import RegexpTokenizer
TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
s = "a woman needs a man like a fish needs a bicycle"
regex_tokenize = TOKENIZER.tokenize(s)
default_tokenize = nltk.word_tokenize(s)
regex_tag = nltk.pos_tag(regex_tokenize)
default_tag = nltk.pos_tag(default_tokenize)
print regex_tag
print "\n"
print default_tag
输出如下:
Regex Tokenizer:
[('a', 'DT'), (' ', 'NN'), ('woman', 'NN'), (' ', ':'), ('needs', 'NNS'), (' ', 'VBP'), ('a', 'DT'), (' ', 'NN'), ('man', 'NN'), (' ', ':'), ('like', 'IN'), (' ', 'NN'), ('a', 'DT'), (' ', 'NN'), ('fish', 'NN'), (' ', ':'), ('needs', 'VBZ'), (' ', ':'), ('a', 'DT'), (' ', 'NN'), ('bicycle', 'NN')]
Default Tokenizer:
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
在正则表达式的分词器鱼是名词,而在默认标记生成器鱼是一个形容词。 根据使用的标记器,解析不同导致不同的分析树结构。
看到http://stackoverflow.com/questions/30821188/python -ntlk-pos-tag-not-returnig-the-correct-pos – alvas