2015-04-24 60 views
0

我尝试创建我自己的推特的情感分析语料库(无论是正或负)。分类评价

我首先尝试在现有NLTK电影审查语料库。 但是,如果我用这个代码:

import string 
from itertools import chain 

from nltk.corpus import movie_reviews as mr 
from nltk.corpus import stopwords 
from nltk.probability import FreqDist 
from nltk.classify import NaiveBayesClassifier as nbc 
import nltk 

stop = stopwords.words('english') 
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()] 

word_features = FreqDist(chain(*[i for i,j in documents])) 
word_features = word_features.keys()[:100] 

numtrain = int(len(documents) * 90/100) 
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] 
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]] 

classifier = nbc.train(train_set) 
print nltk.classify.accuracy(classifier, test_set) 
classifier.show_most_informative_features(5) 

林接收输出:

0.31 
Most Informative Features 
       uplifting = True    pos : neg =  5.9 : 1.0 
       wednesday = True    pos : neg =  3.7 : 1.0 
      controversy = True    pos : neg =  3.4 : 1.0 
        shocks = True    pos : neg =  3.0 : 1.0 
        catchy = True    pos : neg =  2.6 : 1.0 

而不是预期的输出(见Classification using movie review corpus in NLTK/Python):

0.655 
Most Informative Features 
        bad = True    neg : pos =  2.0 : 1.0 
        script = True    neg : pos =  1.5 : 1.0 
        world = True    pos : neg =  1.5 : 1.0 
       nothing = True    neg : pos =  1.5 : 1.0 
        bad = False    pos : neg =  1.5 : 1.0 

我使用与其他StackOverflow页面中的代码完全相同,我的NLTK(和他们的)是最新的,我也拥有最新的电影评论语料库。有想法的人有什么问题?

谢谢!

+0

更好,一旦你看到的只是你的阴茎的长度。 –

回答

0

我的猜测是,下面的线正在差别:

word_features = word_features.keys()[:100] 

word_features是一个字典(计数器更精确的)对象,并keys()方法的返回值在任意顺序使功能在你的训练集列表与初始文章中的功能列表不同。

https://docs.python.org/2/library/stdtypes.html#dict.items

+0

我不认为这是问题所在,因为每次我在不同的计算机上运行此代码时,我总能得到相同的结果(精确度0.31和相同的最丰富的功能) – mvh

+0

keys()'是任意的,但不是随机的,具体实现。如果我从Linux机器运行代码,我可以得到不同的结果,然后在Win框上运行相同的代码。 – valentin