2016-05-02 103 views
0

不同的指数组合dataframes我已经生成概率的数据帧从一个scikit学习分类是这样的:与熊猫

def preprocess_category_series(series, key): 
    if series.dtype != 'category': 
     return series 
    if series.cat.ordered: 
     s = pd.Series(series.cat.codes, name=key) 
     mode = s.mode()[0] 
     s[s<0] = mode 
     return s 
    else: 
     return pd.get_dummies(series, drop_first=True, prefix=key) 

data = df[df.year == 2012] 
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1) 
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)]) 

我现在想这些概率追加回到我原来的数据帧。但是,上面生成的predictions数据帧在保留data中的项目顺序的同时,已经丢失了data的索引。我认为我能够做到

pd.concat([data, predictions], axis=1, ignore_index=True) 

但是这会产生错误:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects 

我已经看到了这有时会出现,如果列名是重复的,但在这种情况下,没有一个是。那是什么错误?将这些数据框拼接在一起的最佳方式是什么?

   year serial hwtfinl      region statefip \ 
cpsid                   
20121000000100 2012  1 3796.85 East South Central Division Alabama 
20121000000100 2012  1 3796.85 East South Central Division Alabama 
20121000000100 2012  1 3796.85 East South Central Division Alabama 
20120800000500 2012  6 2814.24 East South Central Division Alabama 
20120800000600 2012  7 2828.42 East South Central Division Alabama 

       county month pernum   cpsidp  wtsupp ... \ 
cpsid                ...  
20121000000100  0  11  1 20121000000101 3208.1213 ...  
20121000000100  0  11  2 20121000000102 3796.8506 ...  
20121000000100  0  11  3 20121000000103 3386.4305 ...  
20120800000500  0  11  1 20120800000501 2814.2417 ...  
20120800000600 1097  11  1 20120800000601 2828.4193 ...  

       race  hispan educ   votereg \ 
cpsid               
20121000000100 White Not Hispanic 111    Voted 
20121000000100 White Not Hispanic 111 Did not register 
20121000000100 White Not Hispanic 111    Voted 
20120800000500 White Not Hispanic 92    Voted 
20120800000600 White Not Hispanic 73 Did not register 

             educ_parsed  age4   educ4 \ 
cpsid                   
20121000000100      Bachelor's degree  65+ College grad 
20121000000100      Bachelor's degree  65+ College grad 
20121000000100      Bachelor's degree Under 30 College grad 
20120800000500 Associate's degree, academic program  45-64 College grad 
20120800000600  High school diploma or equivalent  65+ HS or less 

       race4 region4 gender 
cpsid         
20121000000100 White South Male 
20121000000100 White South Female 
20121000000100 White South Female 
20120800000500 White South Female 
20120800000600 White South Female 

predictions.head()

  a   b   c   d   e   f 
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689 
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093 
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557 
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900 
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447 

只是为了好玩,我专门只用头列试过这样:

pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True) 

同样的错误出现。

+0

这对我来说非常合适。你的熊猫的版本是什么? – Ali

+0

我在版本0.18.0 – futuraprime

+0

可以请打印predictions.head()和data.head()? – Shovalt

回答

0

我也是0.18.0。这是我试过的,它的工作。这是你在做什么?

import numpy as np 
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 
Y = np.array([1, 1, 1, 2, 2, 2]) 
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB() 
clf.fit(X,Y) 
import pandas as pd 
data = pd.DataFrame(X) 
data['y']=Y 
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)]) 
pd.concat([data, predictions], axis=1, ignore_index=True) 
0 1 2    3    4 
0 -1 -1 1 1.000000e+00 1.522998e-08 
1 -2 -1 1 1.000000e+00 3.775135e-11 
2 -3 -2 1 1.000000e+00 5.749523e-19 
3 1 1 2 1.522998e-08 1.000000e+00 
4 2 1 2 3.775135e-11 1.000000e+00 
5 3 2 2 5.749523e-19 1.000000e+00 
+0

这与我正在做的几乎相同 - 唯一显着的区别是分类器是从不同的数据集生成的。 – futuraprime

+0

这应该没有任何作用。你可以用你想要的任何数据来训练你的分类器。你可以添加更多的代码吗? – Ali

+0

增加了更多的代码 - 我认为基本上是整个事情。 – futuraprime

0

原来有一个相对简单的解决方案:

predictions.index = data.index 
pd.concat([data, predictions], axis=1) 

现在,它完美的作品。不知道为什么它不会像我最初尝试过的那样工作。