2014-02-17 103 views
0

我正在运行Logistic回归,并想绘制学习曲线来获得数据的感觉。我怎样才能做到这一点 ?这里是我的代码至今:如何绘制Logistic回归的学习曲线?

from sklearn import metrics,preprocessing,cross_validation 
    from sklearn.feature_extraction.text import TfidfVectorizer 
    import sklearn.linear_model as lm 
    import pandas as p 
    loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ') 

    print "loading data.." 
    traindata = list(np.array(p.read_table('train.tsv'))[:,2]) 
    testdata = list(np.array(p.read_table('test.tsv'))[:,2]) 
    y = np.array(p.read_table('train.tsv'))[:,-1] 

    tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', 
     analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) 

    rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
          C=1, fit_intercept=True, intercept_scaling=1.0, 
          class_weight=None, random_state=None) 

    X_all = traindata + testdata 
    lentrain = len(traindata) 

    print "fitting pipeline" 
    tfv.fit(X_all) 
    print "transforming data" 
    X_all = tfv.transform(X_all) 

    X = X_all[:lentrain] 
    X_test = X_all[lentrain:] 

    print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc')) 

    print "training on full data" 
    rd.fit(X,y) 
    pred = rd.predict_proba(X_test)[:,1] 
    testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1) 
    pred_df = p.DataFrame(pred, index=testfile.index, columns=['label']) 
    pred_df.to_csv('benchmark.csv') 
    print "submission file created.." 

我想什么制作的是这样的事情,这样我就可以有一个更好的了解正在发生的事情的:

Image of expected output

任何人的帮助我请这个好吗?

回答

1

不太一般,因为它应该是,但它会与你结束一点点摆弄做的工作。

from matplotlib import pyplot as plt 
from sklearn import metrics 
import numpy as np 

def data_size_response(model,trX,teX,trY,teY,score_func,prob=True,n_subsets=20): 

    train_errs,test_errs = [],[] 
    subset_sizes = np.exp(np.linspace(3,np.log(trX.shape[0]),n_subsets)).astype(int) 

    for m in subset_sizes: 
     model.fit(trX[:m],trY[:m]) 
     if prob: 
      train_err = score_func(trY[:m],model.predict_proba(trX[:m])) 
      test_err = score_func(teY,model.predict_proba(teX)) 
     else: 
      train_err = score_func(trY[:m],model.predict(trX[:m])) 
      test_err = score_func(teY,model.predict(teX)) 
     print "training error: %.3f test error: %.3f subset size: %.3f" % (train_err,test_err,m) 
     train_errs.append(train_err) 
     test_errs.append(test_err) 

    return subset_sizes,train_errs,test_errs 

def plot_response(subset_sizes,train_errs,test_errs): 

    plt.plot(subset_sizes,train_errs,lw=2) 
    plt.plot(subset_sizes,test_errs,lw=2) 
    plt.legend(['Training Error','Test Error']) 
    plt.xscale('log') 
    plt.xlabel('Dataset size') 
    plt.ylabel('Error') 
    plt.title('Model response to dataset size') 
    plt.show() 

model = # put your model here 
score_func = # put your scoring function here 
response = data_size_response(model,trX,teX,trY,teY,score_func,prob=True) 
plot_response(*response) 

的data_size_response功能需要一个模型(在你的情况下,实例化的LR模型),预分集(火车/试验X和Y阵列,你可以在sklearn使用train_test_split函数生成此),以及一个评分函数作为输入,并在n个指数间隔子集上迭代你的数据集训练,并返回“学习曲线”。还有一个用于可视化此响应的绘图功能。

我也喜欢使用cross_val_score喜欢你的例子,但它需要修改sklearn源找回训练成绩,除了它已经提供了考试成绩。概率论是在模型上使用某种模型/评分函数组合所必需的predict_proba vs predict方法,例如, roc_auc_score。

例情节上MNIST数据集的一个子集: enter image description here

让我知道如果您有任何问题!