2

下面是相关的代码和文档,想知道默认cross_val_score没有明确指定score,输出数组意味着精度,AUC或一些其他指标?scikit学习决策树模型评估

使用Python 2.7与miniconda解释器。

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

>>> from sklearn.datasets import load_iris 
>>> from sklearn.cross_validation import cross_val_score 
>>> from sklearn.tree import DecisionTreeClassifier 
>>> clf = DecisionTreeClassifier(random_state=0) 
>>> iris = load_iris() 
>>> cross_val_score(clf, iris.data, iris.target, cv=10) 
...        
... 
array([ 1.  , 0.93..., 0.86..., 0.93..., 0.93..., 
     0.93..., 0.93..., 1.  , 0.93..., 1.  ]) 

问候, 林

回答

1

user guide

默认情况下,在每个CV的迭代中计算的得分是估计器的得分 方法。它是可以通过使用 得分参数可以改变:

从DecisionTreeClassifier documentation

返回给定的测试数据和标签的平均准确度。在 多标签分类中,这是子集精度,这是一个严格的度量标准,因为您需要对每个标签集合为 的每个样本正确预测。

不要被“平均准确度”所迷惑,它只是计算准确度的常规方法。按照链接到source

from .metrics import accuracy_score 
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight) 

现在sourcemetrics.accuracy_score

def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None): 
    ... 
    # Compute accuracy for each possible representation 
    y_type, y_true, y_pred = _check_targets(y_true, y_pred) 
    if y_type.startswith('multilabel'): 
     differing_labels = count_nonzero(y_true - y_pred, axis=1) 
     score = differing_labels == 0 
    else: 
     score = y_true == y_pred 

    return _weighted_sum(score, sample_weight, normalize) 

如果你still aren't convinced:

def _weighted_sum(sample_score, sample_weight, normalize=False): 
    if normalize: 
     return np.average(sample_score, weights=sample_weight) 
    elif sample_weight is not None: 
     return np.dot(sample_score, sample_weight) 
    else: 
     return sample_score.sum() 

注:accuracy_score规范化参数默认为True,因此只需b。返回np.average oolean numpy数组,因此它只是正确预测的平均数量。

+0

谢谢juanpa.arrivillaga,如果它是一个两类分类问题,每个预测是正确的或错误的。混淆意味着什么意思? –

+1

@ LinMa看我的编辑 - 它只是准确性。 –

+0

感谢juanpa耐心回答,将您的答复标记为答案。 –

1

如果打分的说法没有给出,cross_val_score将默认使用您所使用的估计的.score方法。对于DecisionTreeClassifier,它的意思是精确度(如下面的文档字符串):

In [11]: DecisionTreeClassifier.score? 
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None) 
Docstring: 
Returns the mean accuracy on the given test data and labels. 

In multi-label classification, this is the subset accuracy 
which is a harsh metric since you require for each sample that 
each label set be correctly predicted. 

Parameters 
---------- 
X : array-like, shape = (n_samples, n_features) 
    Test samples. 

y : array-like, shape = (n_samples) or (n_samples, n_outputs) 
    True labels for X. 

sample_weight : array-like, shape = [n_samples], optional 
    Sample weights. 

Returns 
------- 
score : float 
    Mean accuracy of self.predict(X) wrt. y. 
+0

感谢Randy,如果是两类分类问题,那么每个预测都是正确的或者错误的。混淆意味着什么意思? –

+1

假设您预测为0/1,这只是两类分类问题的正确分类的百分比。如果你预测概率,它将是每个预测的(1预测误差)的平均值。 –

+0

感谢兰迪。 :) –