2014-10-19 25 views
2

我的任务是用svm做文本分类,用单词n-gram作为特征。我的代码是:如何用TF-IDF构造单词n-gram的训练矢量

word_dic = ngram.wordNgrams(text, n) 
freq_term_vector = [word_dic[gram] if gram in word_dic else 0 for gram in global_vector] 
X.append(freq_term_vector) 

它运作良好。然而,当我试图TF-IDF,代码如下:

freq_term_vector = [word_dic[gram] if gram in word_dic else 0 for gram in global_vector] 
tfidf = TfidfTransformer(norm="l2") 
tfidf.fit(freq_term_vector) 
X.append(tfidf.transform(freq_term_vector).toarray()) 

训练部分都可以做,但是当程序运行到预测的一部分,它说

clf.predict(X_test) 
    File "/usr/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 223, in predict 
    scores = self.decision_function(X) 
    File "/usr/lib/python2.7/dist-packages/sklearn/linear_model/base.py", line 207, in decision_function 
    dense_output=True) + self.intercept_ 
    File "/usr/lib/python2.7/dist-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot 
    return np.dot(a, b) 
ValueError: shapes (1100,1,38) and (1,11) not aligned: 38 (dim 2) != 1 (dim 0) 

的训练方法和预测方法是一样的。我如何解决这个对齐问题?任何人都可以帮我检查我的代码或给我一些想法?

回答

1

我认为这个问题是追加,请尝试以下操作:

... 
X = tfidf.transform(freq_term_vector) 
... 
X_test = tfidf.transform(freq_term_vector_test) 
clf.predict(X_test) 
+0

我看到的,真正的问题在于append方法。我尝试了扩展方法,效果很好。谢谢! – allenwang 2014-10-21 03:23:05