2017-06-19 62 views
0

我有一个数据集,每个文档都有一个标签,如下例所示。单标签数据集中的多标签文本分类

label   text 

    pay   "i will pay now" 
    finance  "are you the finance guy?" 
    law   "lawyers and law" 
    court   "was at the court today" 
    finance report "bank reported annual share.." 

该文本文档可以标记多个标签,所以我怎么能做这个数据集的多标签分类?我已经阅读了sklearn的大量文档,但似乎无法找到在单标签数据集上进行多标签分类的正确方法。预先感谢您的帮助。

到目前为止,这是我所:

import numpy as np 
import pandas as pd 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.linear_model import SGDClassifier 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import MultiLabelBinarizer 
from sklearn import preprocessing 

loc = r'C:\Users\..\Downloads\excel.xlsx' 

df = pd.read_excel(loc) 
X = np.array(df.docs) 
z = np.array(df.title) 
y = np.array(df.raw) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
random_state=42) 

mlb = preprocessing.MultiLabelBinarizer() 
Y = mlb.fit_transform(y_train) 
Y_test = mlb.fit_transform(y_test) 

classifier = Pipeline([ 
('vectorizer', CountVectorizer()), 
('tfidf', TfidfTransformer()), 
('clf', OneVsRestClassifier(LinearSVC()))]) 

    classifier.fit(X_train, Y) 
    predicted = classifier.predict(X_test) 

doc_new = np.array(['X has announced that it will sell $587 million']) 

print("Accuracy Score: ", accuracy_score(Y_test, predicted)) 
print(mlb.inverse_transform(classifier.predict(doc_new))) 

但我不断收到一个尺寸误差:

.format(len(self.classes_), yt.shape[1]))ValueError: Expected indicator for 44 classes, but got 46

回答

0

我富尔德的解决方案。我用熊猫GroupBy

df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index() 

组合文本与多个类在一起,它的工作。

尺寸误差也已经解决:dimension error