下面是这个问题的设置:如何实现pandas groupby对象的聚合函数?
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
评价在上述过程之后,df
看起来像这样:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf
是df
,其中基团是由值来确定相关联的DataFrameGroupBy
对象在第一列df
。
现在,看这个:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...但有直通包装更换sum
它后重复同样的事情,炸弹:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
这是一个微妙的问题(虽然可能相关)的问题。回想一下,gdf.aggregate(sum)
的结果是一个数据帧,其中有一列,Q
。现在,请注意下面的结果包含列,P
和Q
:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
我一直没能找到文档中任何可能解释
为什么应该
gdf.aggregate(mysum)
失败? (IOW,这是否与失败记录的行为的人,或者是它在大熊猫的错误吗?)为什么应该
gdf.aggregate(lambda *a, **k: rn.random())
产生两列的输出,同时gdf.aggregate(sum)
产生一列输出?什么签名(输入和输出)应聚合函数
foo
有使gdf.aggregate(foo)
将返回只有列Q
表(像gdf.aggregate(sum)
结果)?
我实际上注意到'gdf.aggregate(mysum)'和使用'sum'一样,使用pandas 0.14.0。 – Ajean 2014-09-04 20:28:38