2015-11-23 42 views
1

我试图用一个DecisionTreeClassifier做一些分析,但它给我以下错误:输入长度不匹配scikit

ValueError: Number of features of the model must match the input. Model n_features is 1 and input n_features is 4

我用同样的培训和测试集的SVC和一个GaussianNB分类器和那些都工作得很好。下面是我的代码,我知道测试和训练集具有相同的设计,也就是说,在矢量化之前,他们采用包含字符串的列表的形式。我不知道哪里错配是从

#classify with just scikit 

from sklearn.feature_extraction.text import TfidfVectorizer 
from tools.striper import stripe, cleanupfiles 
from tools.tweetprocessor import clean, wordclean 

from sklearn import svm 
from sklearn.naive_bayes import GaussianNB, MultinomialNB 
from sklearn.metrics import classification_report 
from sklearn import tree 

stripe(0.1) 

training = [] 
traininglabel = [] 
test = [] 
testlabel = [] 

with open('tempdata/goodtraining.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     tweet = [x for x in tweet if len(x) >= 3] 
     training.append(' '.join(tweet)) 
     traininglabel.append('good') 
with open('tempdata/badtraining.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     tweet = [x for x in tweet if len(x) >= 3] 
     training.append(' '.join(tweet)) 
     traininglabel.append('bad') 
with open('tempdata/goodtest.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     test.append(' '.join(tweet)) 
     testlabel.append('good') 
with open('tempdata/badtest.txt','r') as f: 
    for line in f: 
     tweet = [wordclean(x) for x in clean(line.rstrip('\n')).split()] 
     test.append(' '.join(tweet)) 
     testlabel.append('bad') 

vectorizer = TfidfVectorizer(min_df=0.1,max_df=0.9) 
train_vect = vectorizer.fit_transform(training) 
test_vect = vectorizer.fit_transform(test) 

print (train_vect) 
print (test_vect) 

classifier = tree.DecisionTreeClassifier() 
classifier.fit(train_vect.toarray(), traininglabel) 
predictions = classifier.predict(test_vect.toarray()) 

print (classification_report(testlabel, predictions)) 

cleanupfiles() 

回答

1

未来您需要更改

test_vect = vectorizer.fit_transform(test) 

test_vect = vectorizer.transform(test) 

向量化的fit()方法应该只在训练被称为数据。

+0

这样做。谢谢。 –