2017-01-04 108 views
2

假设我有以下数据框:如何为pandas multiindex数据框中的每个子索引添加一行?

import pandas as pd 
df = pd.DataFrame(
    { 
     'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 
     'office_id': list(range(1, 7)) * 2, 
     'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)] 
    } 
) 

这就是:

office_id sales state 
0   1 903325 CA 
1   2 364594 WA 
2   3 737728 CO 
3   4 239378 AZ 
4   5 833003 CA 
5   6 501536 WA 
6   1 920821 CO 
7   2 879602 AZ 
8   3 661818 CA 
9   4 548888 WA 
10   5 842459 CO 
11   6 906791 AZ 

现在我做office_idstates一个groupby操作:

df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}) 

这导致:

    sales 
office_id state 
1   CA  903325 
      CO  920821 
2   AZ  879602 
      WA  364594 
3   CA  661818 
      CO  737728 
4   AZ  239378 
      WA  548888 
5   CA  833003 
      CO  842459 
6   AZ  906791 
      WA  501536 

是否可以为每个office_id添加一行,并使用新索引total(例如,这是销售列的每个州的总和)?

我可以通过分组"office_id"和sum来计算它,但是我获得了一个新的DataFrame,并且我没有成功合并它。

回答

2

您可以通过Series.unstack重塑,增加新的列total然后重塑回来DataFrame.stack,如果需要MultiIndex使用Series.to_frame

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack() 
df1['total'] = df1.sum(axis=1) 
df1 = df1.stack().to_frame('sales') 
print (df1) 
        sales 
office_id state   
1   CA  505047.0 
      CO  724412.0 
      total 1229459.0 
2   AZ  402775.0 
      WA  339803.0 
      total 742578.0 
3   CA  343655.0 
      CO  833474.0 
      total 1177129.0 
4   AZ  574130.0 
      WA  656577.0 
      total 1230707.0 
5   CA  122260.0 
      CO  207717.0 
      total 329977.0 
6   AZ  262568.0 
      WA  504491.0 
      total 767059.0 

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack() 
df1['total'] = df1.sum(axis=1) 
df1 = df1.stack().to_frame('sales') 
#cast if sales are always integers 
df1.sales = df1.sales.astype(int) 
print (df1) 
        sales 
office_id state   
1   CA  323107 
      CO  658336 
      total 981443 
2   AZ  273728 
      WA  942249 
      total 1215977 
3   CA  773390 
      CO  692275 
      total 1465665 
4   AZ  669435 
      WA  735141 
      total 1404576 
5   CA  530182 
      CO4 
      total 762286 
6   AZ  532248 
      WA  951481 
      total 1483729 

时序

def jez(df): 
    df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack() 
    df1['total'] = df1.sum(axis=1) 
    df1 = df1.stack().to_frame('sales') 
    df1.sales = df1.sales 
    return (df1) 

print (jez(df)) 

In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack()) 
100 loops, best of 3: 14.6 ms per loop 

In [340]: %timeit (jez(df)) 
100 loops, best of 3: 2.78 ms per loop 
2

通过将margins参数设置为True,Pandas具有内置功能,可通过pivot_table执行此操作。它只能正确排序,因为'total'是小写字母,大写字母首先出现。

df.pivot_table(index='office_id', columns='state', margins=True, 
       margins_name='total', aggfunc='sum').stack() 

        sales 
office_id state   
1   CA  415727.0 
      CO  240142.0 
      total 655869.0 
2   AZ  126350.0 
      WA  385698.0 
      total 512048.0 
3   CA  387320.0 
      CO  487075.0 
      total 874395.0 
4   AZ  978018.0 
      WA  878368.0 
      total 1856386.0 
5   CA  105057.0 
      CO  852025.0 
      total 957082.0 
6   AZ  130853.0 
      WA  435940.0 
      total 566793.0 
total  AZ  1235221.0 
      CA  908104.0 
      CO  1579242.0 
      WA  1700006.0 
      total 5422573.0 
0

您还可以使用concat来追加聚合总数如下。

返回

    sales 
office_id state   
1   CA  914776 
      CO  902173 
2   AZ  605783 
      WA  865189 
3   CA  280203 
      CO  556867 
4   AZ  958747 
      WA  643333 
5   CA  703606 
      CO  644399 
6   AZ  768268 
      WA  834051 
Total  AZ  2332798 
      CA  1898585 
      CO  2103439 
      WA  2342573 

这里,Data.frame是在办公室,国家和州两级汇总。这些与.concat连接。聚合到州级的DataFrame在连接之前必须给予额外的索引。这是通过​​完成的。另外,索引必须交换以符合办公室状态级DataFrame。

相关问题