3

在下面的代码中,我知道我的naivebayes分类器工作正常,因为它在trainset1上正常工作,但它为什么不在trainset2上工作?我甚至尝试过两个分类器,一个来自TextBlob,另一个来自nltk。nltk naivebayes分类器的文本分类

from textblob.classifiers import NaiveBayesClassifier 
from textblob import TextBlob 
from nltk.tokenize import word_tokenize 
import nltk 

trainset1 = [('I love this sandwich.', 'pos'), 
('This is an amazing place!', 'pos'), 
('I feel very good about these beers.', 'pos'), 
('This is my best work.', 'pos'), 
("What an awesome view", 'pos'), 
('I do not like this restaurant', 'neg'), 
('I am tired of this stuff.', 'neg'), 
("I can't deal with this", 'neg'), 
('He is my sworn enemy!', 'neg'), 
('My boss is horrible.', 'neg')] 

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'), 
     ('hello i was there and no one came', 'class2'), 
     ('all negative terms like sad angry etc', 'class2')] 

def nltk_naivebayes(trainset, test_sentence): 
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0])) 
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset] 
    classifier = nltk.NaiveBayesClassifier.train(t) 
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words} 
    return classifier.classify(test_sent_features) 

def textblob_naivebayes(trainset, test_sentence): 
    cl = NaiveBayesClassifier(trainset) 
    blob = TextBlob(test_sentence,classifier=cl) 
    return blob.classify() 

test_sentence1 = "he is my horrible enemy" 
test_sentence2 = "inflation soaring limps to anniversary" 

print nltk_naivebayes(trainset1, test_sentence1) 
print nltk_naivebayes(trainset2, test_sentence2) 
print textblob_naivebayes(trainset1, test_sentence1) 
print textblob_naivebayes(trainset2, test_sentence2) 

输出:

neg 
class2 
neg 
class2 

虽然test_sentence2明显属于1类。

回答

3

我会假设你明白,你不能指望一个分类器只用3个例子学习一个好的模型,而你的问题更多的是要理解为什么它在这个特定的例子中。

这样做的可能原因是朴素贝叶斯分类器使用先前的类概率。也就是说,无论文本如何,neg与pos的概率。在你的情况下,2/3的例子是否定的,因此前者的负值为66%,后者为33%。在你的单一积极情况下,积极的词汇是“周年纪念”和“飙升”,这不太可能足以弥补这种先入为主的概率。

尤其要注意的是,词概率的计算涉及到各种“平滑”功能(例如,它将在每个类中为log10(Term Frequency + 1),而不是log10(Term Frequency)以防止低频词分类结果过零,等等。因此,“周年纪念”和“飙升”的概率对于负值不为0.0,对于正值为1.0,不像您可能预期的那样。