scikit学习决策树模型评估

下面是相关的代码和文档，想知道默认cross_val_score没有明确指定score，输出数组意味着精度，AUC或一些其他指标？scikit学习决策树模型评估

使用Python 2.7与miniconda解释器。

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

>>> from sklearn.datasets import load_iris 
>>> from sklearn.cross_validation import cross_val_score 
>>> from sklearn.tree import DecisionTreeClassifier 
>>> clf = DecisionTreeClassifier(random_state=0) 
>>> iris = load_iris() 
>>> cross_val_score(clf, iris.data, iris.target, cv=10) 
...        
... 
array([ 1.  , 0.93..., 0.86..., 0.93..., 0.93..., 
     0.93..., 0.93..., 1.  , 0.93..., 1.  ])

问候，林

来源

2016-08-23 Lin Ma

从user guide：

默认情况下，在每个CV的迭代中计算的得分是估计器的得分方法。它是可以通过使用得分参数可以改变：

从DecisionTreeClassifier documentation：

返回给定的测试数据和标签的平均准确度。在多标签分类中，这是子集精度，这是一个严格的度量标准，因为您需要对每个标签集合为的每个样本正确预测。

不要被“平均准确度”所迷惑，它只是计算准确度的常规方法。按照链接到source：

from .metrics import accuracy_score 
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)

现在source的metrics.accuracy_score

def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None): 
    ... 
    # Compute accuracy for each possible representation 
    y_type, y_true, y_pred = _check_targets(y_true, y_pred) 
    if y_type.startswith('multilabel'): 
     differing_labels = count_nonzero(y_true - y_pred, axis=1) 
     score = differing_labels == 0 
    else: 
     score = y_true == y_pred 

    return _weighted_sum(score, sample_weight, normalize)

如果你still aren't convinced:

def _weighted_sum(sample_score, sample_weight, normalize=False): 
    if normalize: 
     return np.average(sample_score, weights=sample_weight) 
    elif sample_weight is not None: 
     return np.dot(sample_score, sample_weight) 
    else: 
     return sample_score.sum()

注：accuracy_score规范化参数默认为True，因此只需b。返回np.average oolean numpy数组，因此它只是正确预测的平均数量。

来源

2016-08-23 07:05:46

谢谢juanpa.arrivillaga，如果它是一个两类分类问题，每个预测是正确的或错误的。混淆意味着什么意思？ –

@ LinMa看我的编辑 - 它只是准确性。 –

感谢juanpa耐心回答，将您的答复标记为答案。 –

如果打分的说法没有给出，cross_val_score将默认使用您所使用的估计的.score方法。对于DecisionTreeClassifier，它的意思是精确度（如下面的文档字符串）：

In [11]: DecisionTreeClassifier.score? 
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None) 
Docstring: 
Returns the mean accuracy on the given test data and labels. 

In multi-label classification, this is the subset accuracy 
which is a harsh metric since you require for each sample that 
each label set be correctly predicted. 

Parameters 
---------- 
X : array-like, shape = (n_samples, n_features) 
    Test samples. 

y : array-like, shape = (n_samples) or (n_samples, n_outputs) 
    True labels for X. 

sample_weight : array-like, shape = [n_samples], optional 
    Sample weights. 

Returns 
------- 
score : float 
    Mean accuracy of self.predict(X) wrt. y.

来源

2016-08-23 07:05:28

感谢Randy，如果是两类分类问题，那么每个预测都是正确的或者错误的。混淆意味着什么意思？ –

假设您预测为0/1，这只是两类分类问题的正确分类的百分比。如果你预测概率，它将是每个预测的（1预测误差）的平均值。 –

感谢兰迪。 :) –

scikit学习决策树模型评估

回答

相关问题