2017-03-01 50 views
0

好吧,让我训练了一个NaiveBayes电影评论分类器...但是,当我运行它反对负面评论(从一个网站,我复制并粘贴到一个txt文件)我是'pos'...我做错了什么?这是下面的代码:得到负面评论的'pos'测试

import nltk, random 
from nltk.corpus import movie_reviews 
documents = [(list(movie_reviews.words(fileid)), category) 
for category in movie_reviews.categories() 
for fileid in movie_reviews.fileids(category)] 
random.shuffle(documents) 
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) 
word_features = list(all_words)[:2000] 

def document_features(document): 
    document_words = set(document) 
    features = {} 
    for word in word_features: 
     features['contains({})'.format(word)] = (word in document_words) 
    return features 

featuresets = [(document_features(d), c) for (d,c) in documents] 
train_set, test_set = featuresets[100:], featuresets[:100] 
classifier = nltk.NaiveBayesClassifier.train(train_set) 

print(nltk.classify.accuracy(classifier, test_set)) 
classifier.show_most_informative_features(5) 
>>>0.67 
>>>Most Informative Features 
     contains(thematic) = True    pos : neg =  8.9 : 1.0 
     contains(annual) = True    pos : neg =  8.9 : 1.0 
     contains(miscast) = True    neg : pos =  8.7 : 1.0 
     contains(supports) = True    pos : neg =  6.9 : 1.0 
    contains(unbearable) = True    neg : pos =  6.7 : 1.0 

f = open('negative_review.txt','rU') 
fraw = f.read() 
review_tokens =nltk.word_tokenize(fraw) 
docfts = document_features(review_tokens) 

classifier.classify(docfts) 
>>> 'pos' 

UPDATE重新运行程序几次之后,现在准确分类我的负面评论为负...有人可以帮助我了解为什么?或者这是简单的魔法?

回答

1

分类器并非100%准确。一个更好的测试是看看分类器如何处理多个电影评论。我发现分类器的准确性是67%,这意味着1/3的评论将被错误分类。您可以尝试使用不同的分类器或不同的功能来改进模型(尝试n-gram和word2vec)。

+0

该任务要求仅使用NaiveBayes分类器:/ –

+0

您的代码没有问题,您只需改进功能。有一定的准确度门槛,你必须打? – megadarkfriend

+0

nah ...实际上,重新运行几次后有什么奇怪的...它实际上将我的负面评论归类为负面!这太奇怪了......我会截取这个运行并在我的任务下发布!精度也自己上升到0.7!这是巫术吗? –