如何在nltk列表中添加更多停用词？

我有以下代码。我必须在nltk stopword列表中添加更多的单词。在我运行thsi之后，它不会添加列表中的单词如何在nltk列表中添加更多停用词？

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer 
import string 
stop = set(stopwords.words('english'))  
new_words = open("stopwords_en.txt", "r") 
new_stopwords = stop.union(new_word) 
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer() 
def clean(doc): 
    stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])  
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude) 
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split()) 
    return normalized 
doc_clean = [clean(doc).split() for doc in emails_body_text]

来源

2017-09-21 Vrushab Jain

请修正缩进代码 - 它没有意义的方式，你有它。 – alexis

'new_stopwords = stop.union（new_word）'一定要读'new_stopwords = stop.union（new_words）'？此外，'new_words = open（“stopwords_en.txt”，“r”）'会返回一个文件对象，所以您将文件对象添加到停用词列表中，而不是内容。你想像'new_words = open（“stopwords_en.txt”，“r”）。readlines（）'肯定吗？ –

不要盲目地做事。阅读新的停用词列表，检查它是否正确，然后然后将其添加到其他停用词列表中。从@greg_data建议的代码开始，但你需要去掉换行符，也许还有其他的东西 - 谁知道你的停用词文件是什么样的？

这可能做到这一点，例如：

new_words = open("stopwords_en.txt", "r").read().split() 
new_stopwords = stop.union(new_words)

PS。不要继续分裂并加入你的文档;标记一次并使用令牌列表工作。

来源

2017-09-21 13:29:38 alexis

如何在nltk列表中添加更多停用词？

回答

相关问题