如何为nltk难题提供（或生成）标签

我有一套文档，我想将它们转换成这种形式，以便能够对这些文档中的单词进行tfidf计数（以便每个文档由tfidf数字的向量表示）。如何为nltk难题提供（或生成）标签

我认为这足以称呼WordNetLemmatizer.lemmatize（单词），然后PorterStemmer - 但所有的'have'，'has'，'had'等都没有被lemmatizer转化为'have'它也适用于其他词汇。然后我读到，我应该为lemmatizer提供一个提示 - 标签代表一种单词 - 无论是名词，动词，形容词等。

我的问题是 - 如何获得这些标签？为了得到这些，我应该在这些文件上执行哪些操作？

我正在使用python3.4，而且我一次只能词+词干单个词。我尝试了WordNetLemmatizer和来自nltk的EnglishStemmer，以及stemming.porter2的stem（）。

来源

2016-11-12 Zbyszek M.

好的，我搜索了更多，我发现如何获得这些标签。首先必须做一些预处理，以确保该文件将得到标记（在我的情况下，它是关于从pdf转换为txt后删除了一些遗留的东西）。

然后这些文件必须被标记为句子，然后将每个句子转换为单词数组，然后可以通过nltk tagger进行标记。通过这种词法化可以完成，然后在其上添加词干。

from nltk.tokenize import sent_tokenize, word_tokenize 
# use sent_tokenize to split text into sentences, and word_tokenize to 
# to split sentences into words 
from nltk.tag import pos_tag 
# use this to generate array of tuples (word, tag) 
# it can be then translated into wordnet tag as in 
# [this response][1]. 
from nltk.stem.wordnet import WordNetLemmatizer 
from stemming.porter2 import stem 

# code from response mentioned above 
def get_wordnet_pos(treebank_tag): 
    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return ''  


with open(myInput, 'r') as f: 
    data = f.read() 
    sentences = sent_tokenize(data) 
    ignoreTypes = ['TO', 'CD', '.', 'LS', ''] # my choice 
    lmtzr = WordNetLemmatizer() 
    for sent in sentences: 
     words = word_tokenize(sentence) 
     tags = pos_tag(words) 
     for (word, type) in tags: 
      if type in ignoreTypes: 
       continue 
      tag = get_wordnet_pos(type) 
      if tag == '': 
       continue 
      lema = lmtzr.lemmatize(word, tag) 
      stemW = stem(lema)

而在这一点上，我得到朵朵字stemW，我可以再写入文件，并使用这些计算每个文档TFIDF向量。

来源

2016-11-13 22:05:00

如何为nltk难题提供（或生成）标签

回答

相关问题