训练一个sklearn逻辑回归分类没有所有可能的标签

我想使用scikit学习0.12.1到：训练一个sklearn逻辑回归分类没有所有可能的标签

列车逻辑回归分类
评估举行了验证数据
饲料的分类器向这个分类器提供新数据，并为每次观察检索5个最可能的标签

除了一个特性外，Sklearn使这一切变得非常简单。不能保证每个可能的标签都会出现在用于符合我的分类器的数据中。有数百种可能的标签，其中一些标签没有出现在可用的培训数据中。

这将导致两个问题：当它们发生在验证数据

标签矢量化不承认以前看不到的标签。这很容易通过将标签符合到可能的标签集来解决，但它加重了问题2.
LogisticRegression分类器的predict_proba方法的输出是[n_samples，n_classes]数组，其中n_classes包含只有在培训数据中看到的类。这意味着在predict_proba数组上运行argsort不再提供直接映射到标签向量化程序的词汇表的值。

我的问题是，什么是迫使分类器识别全套可能的类，即使其中一些不存在于训练数据中的最佳方式是什么？很明显，它无法学习它从未见过数据的标签，但0在我的情况下是完全可用的。

来源

2013-02-22 Alexander Measure

这是一个解决方法。确保你有一个名为all_classes的所有类别的列表。然后，如果clf是你LogisticRegression分类，

from itertools import repeat 

# determine the classes that were not present in the training set; 
# the ones that were are listed in clf.classes_. 
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes) 

# the order of classes in predict_proba's output matches that in clf.classes_. 
prob = clf.predict_proba(test_samples) 
for row in prob: 
    prob_per_class = (zip(clf.classes_, prob) 
        + zip(classes_not_trained, repeat(0.)))

产生的(cls, prob)对列表。在larsman的出色答卷

来源

2013-02-23 11:21:31

更优雅比工作，我周围使用。所有sklearn分类器中是否存在classes_属性？在0.12.1 LogisticRegression中只有label_，但在更高版本中似乎会更改。 – 2013-02-23 16:09:15

@AlexanderMeasure：是的，'classes_'应该出现在所有的分类器上，但目前不是 - 这是一个已知的错误，每个类都有固定的基础。 0.13在LR上有'classes_'，我忘了0.12.1还没有。 – 2013-02-23 17:19:04

糟糕，这不起作用。 clf.predict_proba返回形状数组[n_samples，n_clf_classes]。数组迭代跨行，从而使用压缩类的结果将类压缩为来自测试样本的n_clf_classes长度概率数组，这不是特别有用。但是，如果我们将类压缩到每行，它就可以工作。 – 2013-02-25 18:23:29

大厦，我结束了这一点：

from itertools import repeat 
import numpy as np 

# determine the classes that were not present in the training set; 
# the ones that were are listed in clf.classes_. 
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes) 

# the order of classes in predict_proba's output matches that in clf.classes_. 
prob = clf.predict_proba(test_samples) 
new_prob = [] 
for row in prob: 
    prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.)) 
    # put the probabilities in class order 
    prob_per_class = sorted(prob_per_class) 
    new_prob.append(i[1] for i in prob_per_class) 
new_prob = np.asarray(new_prob)

new_prob是[N_SAMPLES次，n_classes]数组就像从predict_proba输出，除了现在它包含0的概率为前所未见的类。

来源

2013-02-25 19:24:06

如果你想是什么样的，通过predict_proba返回数组，但与列对应于排序all_classes，怎么样：

all_classes = numpy.array(sorted(all_classes)) 
# Get the probabilities for learnt classes 
prob = clf.predict_proba(test_samples) 
# Create the result matrix, where all values are initially zero 
new_prob = numpy.zeros((prob.shape[0], all_classes.size)) 
# Set the columns corresponding to clf.classes_ 
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob

来源

2013-03-02 13:56:10 joeln

训练一个sklearn逻辑回归分类没有所有可能的标签

回答

相关问题