2017-01-29 1288 views
13

我有以下代码来测试一些sklearn Python库中最流行的ML算法:逻辑回归:未知的标签类型:“连续”使用sklearn在python

import numpy as np 
from sklearn      import metrics, svm 
from sklearn.linear_model   import LinearRegression 
from sklearn.linear_model   import LogisticRegression 
from sklearn.tree     import DecisionTreeClassifier 
from sklearn.neighbors    import KNeighborsClassifier 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
from sklearn.naive_bayes   import GaussianNB 
from sklearn.svm     import SVC 

trainingData = np.array([ [2.3, 4.3, 2.5], [1.3, 5.2, 5.2], [3.3, 2.9, 0.8], [3.1, 4.3, 4.0] ]) 
trainingScores = np.array([3.4, 7.5, 4.5, 1.6]) 
predictionData = np.array([ [2.5, 2.4, 2.7], [2.7, 3.2, 1.2] ]) 

clf = LinearRegression() 
clf.fit(trainingData, trainingScores) 
print("LinearRegression") 
print(clf.predict(predictionData)) 

clf = svm.SVR() 
clf.fit(trainingData, trainingScores) 
print("SVR") 
print(clf.predict(predictionData)) 

clf = LogisticRegression() 
clf.fit(trainingData, trainingScores) 
print("LogisticRegression") 
print(clf.predict(predictionData)) 

clf = DecisionTreeClassifier() 
clf.fit(trainingData, trainingScores) 
print("DecisionTreeClassifier") 
print(clf.predict(predictionData)) 

clf = KNeighborsClassifier() 
clf.fit(trainingData, trainingScores) 
print("KNeighborsClassifier") 
print(clf.predict(predictionData)) 

clf = LinearDiscriminantAnalysis() 
clf.fit(trainingData, trainingScores) 
print("LinearDiscriminantAnalysis") 
print(clf.predict(predictionData)) 

clf = GaussianNB() 
clf.fit(trainingData, trainingScores) 
print("GaussianNB") 
print(clf.predict(predictionData)) 

clf = SVC() 
clf.fit(trainingData, trainingScores) 
print("SVC") 
print(clf.predict(predictionData)) 

的前两部作品不错,但我得到了在LogisticRegression通话以下错误:

[email protected]:/home/ouhma# python stack.py 
LinearRegression 
[ 15.72023529 6.46666667] 
SVR 
[ 3.95570063 4.23426243] 
Traceback (most recent call last): 
    File "stack.py", line 28, in <module> 
    clf.fit(trainingData, trainingScores) 
    File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/logistic.py", line 1174, in fit 
    check_classification_targets(y) 
    File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets 
    raise ValueError("Unknown label type: %r" % y_type) 
ValueError: Unknown label type: 'continuous' 

输入数据是一样的,在之前的电话,所以这到底是怎么回事呢?

顺便说一下,为什么在LinearRegression()SVR()算法(15.72 vs 3.95)的第一个预测中存在巨大差异?

回答

19

您正在将浮点数传递给需要分类值作为目标矢量的分类器。如果您将其转换为int它将被接受为输入(尽管这是否是正确的方式来执行此操作将会有疑问)。

通过使用scikit的labelEncoder函数来转换您的培训分数会更好。

您的DecisionTree和KNeighbors限定符也是如此。

from sklearn import preprocessing 
from sklearn import utils 

lab_enc = preprocessing.LabelEncoder() 
encoded = lab_enc.fit_transform(trainingScores) 
>>> array([1, 3, 2, 0], dtype=int64) 

print(utils.multiclass.type_of_target(trainingScores)) 
>>> continuous 

print(utils.multiclass.type_of_target(trainingScores.astype('int'))) 
>>> multiclass 

print(utils.multiclass.type_of_target(encoded)) 
>>> multiclass 
+1

谢谢!所以我必须将'2.3'转换为'23'等等,不是吗?有一种使用numpy或pandas进行转换的优雅方法? – harrison4

+1

但是,在这个例子中,输入数据使用LogisticRegression函数具有浮点数:http://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/ ...并且它工作正常。为什么? – harrison4

+0

输入可以是浮点数,但输出需要是分类的,即int。在这个例子中,第8列只有0或1。 通常情况下,您可以使用分类标签,例如['红','大','生病'],你需要将其转换为数值。请尝试http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features或http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html –

3

我试图向分类器提供浮动数据时遇到同样的问题。我想保留漂浮物而不是整数的准确性。尝试使用回归算法。例如:

import numpy as np 
from sklearn import linear_model 
from sklearn import svm 

classifiers = [ 
    svm.SVR(), 
    linear_model.SGDRegressor(), 
    linear_model.BayesianRidge(), 
    linear_model.LassoLars(), 
    linear_model.ARDRegression(), 
    linear_model.PassiveAggressiveRegressor(), 
    linear_model.TheilSenRegressor(), 
    linear_model.LinearRegression()] 

trainingData = np.array([ [2.3, 4.3, 2.5], [1.3, 5.2, 5.2], [3.3, 2.9, 0.8], [3.1, 4.3, 4.0] ]) 
trainingScores = np.array([3.4, 7.5, 4.5, 1.6]) 
predictionData = np.array([ [2.5, 2.4, 2.7], [2.7, 3.2, 1.2] ]) 

for item in classifiers: 
    print(item) 
    clf = item 
    clf.fit(trainingData, trainingScores) 
    print(clf.predict(predictionData),'\n')