2017-03-07 138 views
0

在使用教程完成了一些课程和示例之后,我尝试创建我的第一个机器学习模型。我从这里获得了训练数据:https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv,我正在使用熊猫来加载此csv数据。scikit学习LinearRegression字符串预测值

主要问题是预测列是字符串,所有算法都与浮点数一起使用。

当然,我可以手动映射所有字符串与数字(0,1,2),并使用更改文件,但我试图找出一种方法来自动替换字符串值使用熊猫或scikit学习和保存它们映射在一个分离阵列。

我的代码是:

import pandas as pd 
from sklearn.cross_validation import train_test_split 
from sklearn.linear_model import LinearRegression 

data = pd.read_csv("https://raw.github.com/pydata/pandas/master/pandas/tests/data/iris.csv") 

data.head() 

features_cols = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'] 
X = df[features_cols] 
y = data.Name 

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1) 
linreg = LinearRegression() 
linreg.fit(X_train, y_train) 

是看到的错误是:

ValueError: could not convert string to float: 'Iris-setosa' 

如何我可以代替使用熊猫从整数“名称”列中的所有值?

回答

1

可以使用scikit学习的LabelEncoder

>>> from pandas import pd 
>>> from sklearn import preprocessing 
>>> df = pd.DataFrame({'Name':['Iris-setosa','Iris-setosa','Iris-versicolor','Iris-virginica','Iris-setosa','Iris-versicolor'], 'a': [1,2,3,4,1,1]}) 
>>> y = df.Name 
>>> le = preprocessing.LabelEncoder() 
>>> le.fit(y) # fit your y array 
LabelEncoder() 
>>> le.classes_ # check your unique classes 
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object) 
>>> y_transformed = le.transform(y) # transform your y with numeric encodings 
>>> y_transformed 
array([0, 0, 1, 2, 0, 1], dtype=int64) 
-1

我建议你直接从导入iris dataset scikit学习这样的:

from sklearn import datasets 

iris = datasets.load_iris() 
X = iris.data 
y = iris.target 

演示:

In [9]: from sklearn.cross_validation import train_test_split 

In [10]: from sklearn.linear_model import LinearRegression 

In [11]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) 

In [12]: linreg = LinearRegression() 

In [13]: linreg.fit(X_train, y_train) 
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 

In [14]: linreg.score(X_test, y_test) 
Out[14]: 0.89946565707178838 

In [15]: y 
Out[15]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 
+0

匿名downvoter再次袭击... – Tonechas