2017-10-19 56 views
1

我有一个数据帧有两个级别的列索引。我需要在两个键(列)上具有不同的聚合函数。但是,我收到了我的代码错误。我如何聚合多级数据框中的多列。聚合在多级索引

dic1 = {('count', 'N.A.'): {Period('1993-01', 'M'): 0, 
    Period('1993-02', 'M'): 0, 
    Period('1993-03', 'M'): 0}, 
('count', 'No'): {Period('1993-01', 'M'): 1, 
    Period('1993-02', 'M'): 1, 
    Period('1993-03', 'M'): 1}, 
('count', 'Yes'): {Period('1993-01', 'M'): 0, 
    Period('1993-02', 'M'): 0, 
    Period('1993-03', 'M'): 0}, 
('sum', 'N.A.'): {Period('1993-01', 'M'): nan, 
    Period('1993-02', 'M'): nan, 
    Period('1993-03', 'M'): nan}, 
('sum', 'No'): {Period('1993-01', 'M'): 6.5820000000000007, 
    Period('1993-02', 'M'): 131.1865, 
    Period('1993-03', 'M'): 133.31049999999999}, 
('sum', 'Yes'): {Period('1993-01', 'M'): nan, 
    Period('1993-02', 'M'): nan, 
    Period('1993-03', 'M'): nan}} 

df1 = pd.DataFrame(dic1) 

df1.to_timestamp(how='end').groupby(pd.TimeGrouper('A')).agg(
{'count':['max', 'min', 'median', 'last'] , 
'sum':['mean', 'max' , 'last']}) 

error: KeyError: 'sum' 

enter image description here

回答

1

你可以扁平化的列多指标分组之前:

df1 = pd.DataFrame(dic1) 
df2 = df1.to_timestamp(how='end') 
df2 = df2.rename_axis(['operation', 'YN'], axis=1) 
df3 = df2.stack(level='YN').reset_index('YN') 
# operation  YN count  sum 
# 1993-01-31 N.A.  0  NaN 
# 1993-01-31 No  1 6.5820 
# 1993-01-31 Yes  0  NaN 
# 1993-02-28 N.A.  0  NaN 
# 1993-02-28 No  1 131.1865 
# 1993-02-28 Yes  0  NaN 
# 1993-03-31 N.A.  0  NaN 
# 1993-03-31 No  1 133.3105 
# 1993-03-31 Yes  0  NaN 

一旦您将YN列移入DEX水平成一列(通过调用 stack/reset_index),你可以用通常的方式解决这个问题:


import numpy as np 
import pandas as pd 
Period = pd.Period 
nan = np.nan 

dic1 = {('count', 'N.A.'): {Period('1993-01', 'M'): 0, Period('1993-02', 'M'): 0, Period('1993-03', 'M'): 0}, ('count', 'No'): {Period('1993-01', 'M'): 1, Period('1993-02', 'M'): 1, Period('1993-03', 'M'): 1}, ('count', 'Yes'): {Period('1993-01', 'M'): 0, Period('1993-02', 'M'): 0, Period('1993-03', 'M'): 0}, ('sum', 'N.A.'): {Period('1993-01', 'M'): nan, Period('1993-02', 'M'): nan, Period('1993-03', 'M'): nan}, ('sum', 'No'): {Period('1993-01', 'M'): 6.5820000000000007, Period('1993-02', 'M'): 131.1865, Period('1993-03', 'M'): 133.31049999999999}, ('sum', 'Yes'): {Period('1993-01', 'M'): nan, Period('1993-02', 'M'): nan, Period('1993-03', 'M'): nan}} 

df1 = pd.DataFrame(dic1) 
df2 = df1.to_timestamp(how='end') 
df2 = df2.rename_axis(['operation', 'YN'], axis=1) 
df3 = df2.stack(level='YN').reset_index('YN') 

grouped = df3.groupby([pd.TimeGrouper('A'), 'YN']) 
result = grouped.agg(
    {'count':['max', 'min', 'median', 'last'], 'sum':['mean', 'max' , 'last']}) 
result = result.unstack('YN') 
print(result) 

产量

  sum              count \ 
      mean     max    last     max 
YN   N.A.   No Yes N.A.  No Yes N.A.  No Yes N.A. 
1993-12-31 NaN 90.359667 NaN NaN 133.3105 NaN NaN 133.3105 NaN  0 

      ...            
      ...  min  median  last   
YN   ... Yes N.A. No Yes N.A. No Yes N.A. No Yes 
1993-12-31 ... 0 0 1 0  0 1 0 0 1 0 
+0

谢谢! 'reset_index(-1)'是让熊猫重复日期还是有其他功能? – Roo

+0

['reset_index']](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)移动索引(或者在MultiIndex的情况下,一个级别或级别的MultiIndex)到DataFrame的列。 'reset_index(-1)'将MultiIndex的最后一级移动到一列。在这种情况下,它会将'YN'索引级别移动到同名的新列中。当最后一级没有名字时,'reset_index(-1)'很有用。在这里,我应该使用'reset_index('YN')',因为这更具描述性。 – unutbu

+0

[stack](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.stack.html)将列索引(或列MultiIndex的级别)移动到行索引。在一起,'stack'后跟'reset_index'将列索引的级别移动到新的DataFrame列中。 – unutbu

2

A类哈克的方式做,这是分别使出浑身数量和金额列:

In [11]: agg_dict = {col: ['mean', 'max' , 'median', 'last'] for col in df1.columns[df1.columns.get_level_values(0) == "count"]} 

In [12]: agg_dict.update({col: ['mean', 'max' , 'last'] for col in df1.columns[df1.columns.get_level_values(0) == "sum"]}) 

In [13]: g = df1.to_timestamp(how='end').groupby(pd.TimeGrouper('A')) 

In [14]: g.agg(agg_dict) 
Out[14]: 
      sum              count 
      N.A.     No      Yes   N.A.     No     Yes 
      mean max last  mean  max  last mean max last mean max median last mean max median last mean max median last 
1993-12-31 NaN NaN NaN 90.359667 133.3105 133.3105 NaN NaN NaN  0 0  0 0 1 1  1 1 0 0  0 0 
+0

很好的解决方案,感谢@安迪·海登 – Roo