2017-08-29 69 views
2

我有一个很大的数据框,从中获取我需要的数据与groupby。我需要从新数据框的索引中获取几个单独的列。原来的数据帧的将索引拆分为熊猫中的单独列

部分看起来是这样的:

 code   place  vl year week 
0 111.0002.0056 region1  1 2017 29 
1 112.6500.2285 region2  1 2017 31 
2 112.5600.6325 region2  1 2017 30 
3 112.5600.6325 region2  1 2017 30 
4 112.5600.8159 region2  1 2017 30 
5 111.0002.0056 region2  1 2017 29 
6 111.0002.0056 region2  1 2017 30 
7 111.0002.0056 region2  1 2017 28 
8 112.5600.8159 region3  1 2017 31 
9 112.5600.8159 region3  1 2017 28 
10 111.0002.0114 region3  1 2017 31 
.... 

应用groupby后,它看起来像这样(代码:df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum().unstack(fill_value=0)):

     place region1 region2 region3 region4 index1 
code   year week      
111.0002.0006 2017  29  0  3  0  0  (111.0002.0006, 2017, 29) 
         30  0  7  0  0  (111.0002.0006, 2017, 30) 
111.0002.0018 2017  29  0  0  0  0  (111.0002.0018, 2017, 29) 
111.0002.0029 2017  30  0  0  0  0  (111.0002.0029, 2017, 30) 
111.0002.0055 2017  28  0  33  0  8  (111.0002.0055, 2017, 28) 
         29  1  155  2  41 (111.0002.0055, 2017, 29) 
         30  0  142  1  39 (111.0002.0055, 2017, 30) 
         31  0  31  0  13 (111.0002.0055, 2017, 31) 
111.0002.0056 2017  28  9  36  0  4  (111.0002.0056, 2017, 28) 
         29  20  75  2  37 (111.0002.0056, 2017, 29) 
         30  17  81  2  33 (111.0002.0056, 2017, 30) 
.... 

我救指数在单独的列index1 (代码:df_test1['index1'] = df_test1.index) 我需要走出列index1三个独立的列code,yearweek

结果应该是这样的:

region1 region2 region3 region4  code  year week     
    0  3  0  0 111.0002.0006 2017 29 
    0  7  0  0 111.0002.0006 2017 30 
    0  0  0  0 111.0002.0018 2017 29 
    0  0  0  0 111.0002.0029 2017 30 
    0  33  0  8 111.0002.0055 2017 28 
    1  155  2  41 111.0002.0055 2017 29 
    0  142  1  39 111.0002.0055 2017 30 
    0  31  0  13 111.0002.0055 2017 31 
.... 

我会为任何建议感激!

回答

4

您添加reset_index代替df_test1['index1'] = df_test1.index和清洁df添加rename_axis - 它删除列名place

df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum() \ 
        .unstack(fill_value=0) \ 
        .reset_index() \ 
        .rename_axis(None, axis=1) 
print (df_test1) 

      code year week region1 region2 region3 
0 111.0002.0056 2017 28  0  1  0 
1 111.0002.0056 2017 29  1  1  0 
2 111.0002.0056 2017 30  0  1  0 
3 111.0002.0114 2017 31  0  0  1 
4 112.5600.6325 2017 30  0  2  0 
5 112.5600.8159 2017 28  0  0  1 
6 112.5600.8159 2017 30  0  1  0 
7 112.5600.8159 2017 31  0  0  1 
8 112.6500.2285 2017 31  0  1  0 

最后,如果列的必要的改变顺序:

#all cols are columns in df_test1 
cols = ['code' , 'year', 'week'] 
df_test1 = df_test1[[x for x in df_test1.columns if x not in cols] + cols] 
print (df_test1) 
    region1 region2 region3   code year week 
0  0  1  0 111.0002.0056 2017 28 
1  1  1  0 111.0002.0056 2017 29 
2  0  1  0 111.0002.0056 2017 30 
3  0  0  1 111.0002.0114 2017 31 
4  0  2  0 112.5600.6325 2017 30 
5  0  0  1 112.5600.8159 2017 28 
6  0  1  0 112.5600.8159 2017 30 
7  0  0  1 112.5600.8159 2017 31 
8  0  1  0 112.6500.2285 2017 31 
2

或者你可以试试这个pd.crosstab

df=df.set_index(['code', 'year', 'week','vl']) 
df=pd.crosstab(df.index,df.place).reset_index() 
df[['code', 'year', 'week','vl']]=df['row_0'].apply(pd.Series).drop('row_0',axis=1) 

Out[32]: 
place region1 region2 region3   code year week vl 
0   0  1  0 111.0002.0056 2017 28 1 
1   1  1  0 111.0002.0056 2017 29 1 
2   0  1  0 111.0002.0056 2017 30 1 
3   0  0  1 111.0002.0114 2017 31 1 
4   0  2  0 112.5600.6325 2017 30 1 
5   0  0  1 112.5600.8159 2017 28 1 
6   0  1  0 112.5600.8159 2017 30 1 
7   0  0  1 112.5600.8159 2017 31 1 
8   0  1  0 112.6500.2285 2017 31 1