我有仇恨言论集containining一些10K标志着鸣叫:它看起来像这样Python的Sklearn NGRAM精度降低为NGRAM长度增加
分享Tweet |类
大家好|没有攻势
你丑陋的muppet |进攻而不是仇恨言论
You **** jew |仇恨言论
现在我试图从SKLearn库python中使用MultinomialNB分类器,并且继承我的代码。
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
data = pd.read_excel('myfile', encoding = "utf-8")
data = data.sample(frac=1)
training_base = 0;
training_bounds = 10000;
test_base = training_bounds+1;
test_bounds = 12000;
tweets_train = data['tweet'][training_base:training_bounds]
tweets_test = data['tweet'][test_base:test_bounds]
class_train = data['class'][training_base:training_bounds]
class_test = data['class'][test_base:test_bounds]
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,1))
train_counts = vectorizer.fit_transform(tweets_train.values)
classifier = MultinomialNB()
train_targets = class_train.values
classifier.fit(train_counts, train_targets)
example_counts = vectorizer.transform(tweets_test.values);
predictions = classifier.predict(example_counts)
accuracy = np.mean(predictions == class_test.values)
print(accuracy)
的准确性使用ngram_range(1,1)时为约75%,但如我去(2,2)到(8,8)从它75,72,67..55%降低。为什么是这样?我错过了什么?
伟大的答案,我会试试看,并回来的结果! – samson