2017-02-10 46 views
1

这样的数组就像输入一样,我从.csv文件中读取数据,但是在这里我从列表中构建数据框,以便可以复制问题。目的是通过使用LogisticRegressionCV来交叉验证来训练逻辑回归模型。Sklearn LogisticRegressionCV

indeps = ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F'] 
dep = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 

data = [indeps, dep] 
cols = ['state', 'cat_bins'] 

data_dict = dict((x[0], x[1]) for x in zip(cols, data)) 

df = pd.DataFrame.from_dict(data_dict) 
df.tail() 

    cat_bins state 
45 0.0   F 
46 0.0   M 
47 0.0   M 
48 0.0   F 
49 0.0   F 


'''Use Pandas' to encode independent variables. Notice that 
we are returning a sparse dataframe ''' 

def heat_it2(dataframe, lst_of_columns): 
    dataframe_hot = pd.get_dummies(dataframe, 
            prefix = lst_of_columns, 
            columns = lst_of_columns, sparse=True,) 
    return dataframe_hot 

train_set_hot = heat_it2(df, ['state']) 
train_set_hot.head(2) 

    cat_bins state_F  state_M 
0  1.0   0   1 
1  1.0   1   0 

'''Use the dataframe to set up the prospective inputs to the model as numpy arrays''' 

indeps_hot = ['state_F', 'state_M'] 

X = train_set_hot[indeps_hot].values 
y = train_set_hot['cat_bins'].values 

print 'X-type:', X.shape, type(X) 
print 'y-type:', y.shape, type(y) 
print 'X has shape, is an array and has length:\n', hasattr(X, 'shape'), hasattr(X, '__array__'), hasattr(X, '__len__') 
print 'yhas shape, is an array and has length:\n', hasattr(y, 'shape'), hasattr(y, '__array__'), hasattr(y, '__len__') 
print 'X does have attribute fit:\n',hasattr(X, 'fit') 
print 'y does have attribute fit:\n',hasattr(y, 'fit') 

X-type: (50, 2) <type 'numpy.ndarray'> 
y-type: (50,) <type 'numpy.ndarray'> 
X has shape, is an array and has length: 
True True True 
yhas shape, is an array and has length: 
True True True 
X does have attribute fit: 
False 
y does have attribute fit: 
False 

所以,输入到回归似乎具有用于.fit方法必要的属性。他们是numpy阵列,形状正确X是与尺寸[n_samples, n_features]阵列,并且y是具有形状[n_samples,]这里,向量的文档:

拟合(X,Y,sample_weight =无)[源]

Fit the model according to the given training data. 
Parameters: 

X : {array-like, sparse matrix}, shape (n_samples, n_features) 

    Training vector, where n_samples is the number of samples and n_features is the number of features. 
    y : array-like, shape (n_samples,) 

Target vector relative to X. 

....

现在我们试图以适应回归:

logmodel = LogisticRegressionCV(Cs =1, dual=False , scoring = accuracy_score, penalty = 'l2') 
logmodel.fit(X, y) 

... 

    TypeError: Expected sequence or array-like, got estimator LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, 
    intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, 
    penalty='l2', random_state=None, solver='liblinear', tol=0.0001, 
    verbose=0, warm_start=False) 

错误消息的来源似乎是在scikits的validation.py模块中,here

是引发此错误信息的代码的唯一部分是下面的函数 - 摘录:

def _num_samples(x): 
    """Return number of samples in array-like x.""" 
    if hasattr(x, 'fit'): 
     # Don't get num_samples from an ensembles length! 
     raise TypeError('Expected sequence or array-like, got ' 
         'estimator %s' % x) 
    etc. 

问:因为与我们拟合模型(Xy)参数不有属性“适应”,这是为什么错误信息引发

冠层1.7.4.3348(64位)使用Python 2.7 scikit学习18.01-3和熊猫0.19.2-2

谢谢你的帮助:)

回答

1

这个问题似乎在scoring的论点。您已通过accuracy_scoreaccuracy_score的签名是accuracy_score(y_true, y_pred[, ...])。但模块logistic.py

if isinstance(scoring, six.string_types): 
    scoring = SCORERS[scoring] 
for w in coefs: 
    // Other code 
    if scoring is None: 
     scores.append(log_reg.score(X_test, y_test)) 
    else: 
     scores.append(scoring(log_reg, X_test, y_test)) 

既然你已经通过accuracy_score的,它不符合上述第一线。 和scores.append(scoring(log_reg, X_test, y_test))用于评估估计器。但正如我上面所说,这里的参数不符合accuracy_score所需的参数。因此错误。

解决方法:使用make_scorer(accuracy_score)在LogisticRegressionCV的得分或者干脆把这个字符串 '精度'

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
           scoring = make_scorer(accuracy_score), 
           penalty = 'l2') 

         OR 

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
           scoring = 'accuracy', 
           penalty = 'l2') 

注意:在logistic.py模块的一部分

这可能是一个错误或者在LogisticRegressionCV的文档中,他们应该澄清评分函数的签名。

您可以提交an issue to the github and see how it goes完成

+0

谢谢你,无论你的建议避免错误。你能不能告诉我错误信息来源的哪部分源代码。 – user2738815

+0

错误的来源与您在问题中指出的相同。但是它为什么会来,因为评分函数提供了不正确的参数。从那里提供了不正确的参数,我已经在第一个代码片段的答案中显示。 –

+0

我很欣赏你花时间。谢谢.. – user2738815