不同的指数组合dataframes我已经生成概率的数据帧从一个scikit学习分类是这样的:与熊猫
def preprocess_category_series(series, key):
if series.dtype != 'category':
return series
if series.cat.ordered:
s = pd.Series(series.cat.codes, name=key)
mode = s.mode()[0]
s[s<0] = mode
return s
else:
return pd.get_dummies(series, drop_first=True, prefix=key)
data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])
我现在想这些概率追加回到我原来的数据帧。但是,上面生成的predictions
数据帧在保留data
中的项目顺序的同时,已经丢失了data
的索引。我认为我能够做到
pd.concat([data, predictions], axis=1, ignore_index=True)
但是这会产生错误:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
我已经看到了这有时会出现,如果列名是重复的,但在这种情况下,没有一个是。那是什么错误?将这些数据框拼接在一起的最佳方式是什么?
:
year serial hwtfinl region statefip \
cpsid
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20120800000500 2012 6 2814.24 East South Central Division Alabama
20120800000600 2012 7 2828.42 East South Central Division Alabama
county month pernum cpsidp wtsupp ... \
cpsid ...
20121000000100 0 11 1 20121000000101 3208.1213 ...
20121000000100 0 11 2 20121000000102 3796.8506 ...
20121000000100 0 11 3 20121000000103 3386.4305 ...
20120800000500 0 11 1 20120800000501 2814.2417 ...
20120800000600 1097 11 1 20120800000601 2828.4193 ...
race hispan educ votereg \
cpsid
20121000000100 White Not Hispanic 111 Voted
20121000000100 White Not Hispanic 111 Did not register
20121000000100 White Not Hispanic 111 Voted
20120800000500 White Not Hispanic 92 Voted
20120800000600 White Not Hispanic 73 Did not register
educ_parsed age4 educ4 \
cpsid
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree Under 30 College grad
20120800000500 Associate's degree, academic program 45-64 College grad
20120800000600 High school diploma or equivalent 65+ HS or less
race4 region4 gender
cpsid
20121000000100 White South Male
20121000000100 White South Female
20121000000100 White South Female
20120800000500 White South Female
20120800000600 White South Female
predictions.head()
:
a b c d e f
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447
只是为了好玩,我专门只用头列试过这样:
pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)
同样的错误出现。
这对我来说非常合适。你的熊猫的版本是什么? – Ali
我在版本0.18.0 – futuraprime
可以请打印predictions.head()和data.head()? – Shovalt