2013-10-09 15 views
8

我想做一个包含字符串作为结果的表的透视。pandas - 使用非数字值的pivot_table? (DataError:没有数字类型来聚合)

import pandas as pd 

df1 = pd.DataFrame({'index' : range(8), 
'variable1' : ["A","A","B","B","A","B","B","A"], 
'variable2' : ["a","b","a","b","a","b","a","b"], 
'variable3' : ["x","x","x","y","y","y","x","y"], 
'result': ["on","off","off","on","on","off","off","on"]}) 

df1.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3']) 

,但我得到:DataError: No numeric types to aggregate

这按预期工作当我改变结果值的数字:

df2 = pd.DataFrame({'index' : range(8), 
'variable1' : ["A","A","B","B","A","B","B","A"], 
'variable2' : ["a","b","a","b","a","b","a","b"], 
'variable3' : ["x","x","x","y","y","y","x","y"], 
'result': [1,0,0,1,1,0,0,1]}) 

df2.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3']) 

而且我得到我所需要的:

variable1 A    B  
variable2 a  b  a b 
variable3 x y x y x y 
index        
0   1 NaN NaN NaN NaN NaN 
1   NaN NaN 0 NaN NaN NaN 
2   NaN NaN NaN NaN 0 NaN 
3   NaN NaN NaN NaN NaN 1 
4   NaN 1 NaN NaN NaN NaN 
5   NaN NaN NaN NaN NaN 0 
6   NaN NaN NaN NaN 0 NaN 
7   NaN NaN NaN 1 NaN NaN 

我知道我可以映射字符串到数字值,然后反向该操作,但也许有一个更优雅的解决方案?

回答

23

我原来的答复是基于熊猫0.14.1,从那时起,在pivot_table功能改变了很多东西(行 - >指数,列 - >列...)

此外,它出现我发布的原始lambda技巧不再适用于Pandas 0.18。你必须提供一个减少功能(即使它是最小值,最大值或平均值)。但即使这样,似乎不合适 - 因为我们没有减少的数据集,只是将其转化....所以我看着拆散更难......

import pandas as pd 

df1 = pd.DataFrame({'index' : range(8), 
'variable1' : ["A","A","B","B","A","B","B","A"], 
'variable2' : ["a","b","a","b","a","b","a","b"], 
'variable3' : ["x","x","x","y","y","y","x","y"], 
'result': ["on","off","off","on","on","off","off","on"]}) 

# these are the columns to end up in the multi-index columns. 
unstack_cols = ['variable1', 'variable2', 'variable3'] 

首先,使用指标设置上的数据索引+要堆叠的列,然后使用级别arg调用unstack。

df1.set_index(['index'] + unstack_cols).unstack(level=unstack_cols) 

产生的数据帧如下。

enter image description here

+0

最后一个解决方案,以取代pivot()熊猫更改0.17.1 – camdenl

+0

@RandallGoodwin,我意识到这个问题是两岁,但我得到错误“ValueError:函数不会减少“使用你的拉姆达,你会知道为什么? – RustyShackleford

+1

另一个想法:如果你可能会出现多个值,你可以通过使你的'aggfunc = lambda x:“”.join([x(y)中的[str(y)])来连接字符串' – dllahr

2

我认为最好的折中办法是用True/False代替开/关,这将使熊猫更好地“理解”数据并以一种智能的,预期的方式行事。

df2 = df1.replace({'on': True, 'off': False}) 

你基本上承认了你的问题。我的回答是,我认为没有更好的办法,无论如何,无论如何,你应该取代'开'/'关'。

正如Andy Hayden在评论中指出的那样,如果您用1/0替换开/关,您将获得更好的性能。

+1

+1,尽管可以更好地使用1和0,从而数据帧具有浮动,而不是对象D型:) –

+0

我从来不认为。好点子。 –

+0

好吧,似乎够清楚:) –

相关问题