2017-04-20 91 views
1

什么是最好的方式来总结df2的列由df3的列在下面?熊猫sumif与重复列名

df = pd.DataFrame(np.random.rand(25).reshape((5,5)),index = ['A','B','C','D','E']) 
df1 = pd.DataFrame(np.random.rand(15).reshape((5,3)),index = ['A','B','C','D','E']) 
df2 = pd.concat([df,df1],axis=1) 
df3 = pd.DataFrame(np.random.rand(25).reshape((5,5)),columns = np.arange(5),index = ['A','B','C','D','E']) 

答案将是df3的形状。为清楚起见

编辑:

df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E']) 
df1 = pd.DataFrame(np.ones(15).reshape((5,3))*2,index = ['A','B','C','D','E'],columns = [1,3,4]) 
df2 = pd.concat([df,df1],axis=1) 
df3 = pd.DataFrame(np.empty((5,5)),columns = np.arange(5),index = ['A','B','C','D','E']) 
print(df2) 
    0 1 2 3 4 1 3 4 
A 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 
B 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 
C 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 
D 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 
E 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 

期望的结果将是:

 0  1  2  3  4 
A 1.0  3.0  1.0  3.0  3.0 
B 1.0  3.0  1.0  3.0  3.0 
C 1.0  3.0  1.0  3.0  3.0 
D 1.0  3.0  1.0  3.0  3.0 
E 1.0  3.0  1.0  3.0  3.0 
+3

究竟你的“由DF3的列求和DF2的列”是什么意思? – splinter

回答

6

您可以按列组的DF:

In [57]: df2.groupby(axis=1, by=df2.columns).sum() 
Out[57]: 
    0 1 2 3 4 
A 1.0 3.0 1.0 3.0 3.0 
B 1.0 3.0 1.0 3.0 3.0 
C 1.0 3.0 1.0 3.0 3.0 
D 1.0 3.0 1.0 3.0 3.0 
E 1.0 3.0 1.0 3.0 3.0 

可以明确指定轴名称:

In [58]: df2.groupby(axis='columns', by=df2.columns).sum() 
Out[58]: 
    0 1 2 3 4 
A 1.0 3.0 1.0 3.0 3.0 
B 1.0 3.0 1.0 3.0 3.0 
C 1.0 3.0 1.0 3.0 3.0 
D 1.0 3.0 1.0 3.0 3.0 
E 1.0 3.0 1.0 3.0 3.0 

a short version from @piRSquared

df2.groupby(df2.columns, 1).sum() 
+1

如果你想赢得高尔夫球,你可以跳过参数名称:-)'df2.groupby(df2.columns,1)。sum()' – piRSquared

+0

@piRSquared,谢谢!添加到答案;-) – MaxU

0

难道这就是你的意思是:

new_df = pd.DataFrame() 
for c in df3.columns: 
    try: 
     new_df[c] = [sum(x) for x in df2[c].values] 
    except: 
     new_df[c] = df2[c].values 
2

让使用T转,groupbysum

df2.T.groupby(level=0).sum().T 

原始DF2:

  0   1   2   3   4   0   1 \ 
A 0.627278 0.008150 0.285077 0.931831 0.683035 0.691318 0.873139 
B 0.246861 0.108021 0.903743 0.030373 0.870753 0.143835 0.251623 
C 0.367309 0.551530 0.193623 0.704314 0.136061 0.102401 0.287334 
D 0.580771 0.592600 0.949666 0.806875 0.288331 0.794173 0.034380 
E 0.088984 0.838401 0.988919 0.636134 0.353484 0.584571 0.090235 

      2 
A 0.763687 
B 0.735570 
C 0.405304 
D 0.446789 
E 0.542930 

new_df2 = df2.T.groupby(level=0).sum().T 
print(new_df2) 

输出新DF2:

  0   1   2   3   4 
A 1.318595 0.881289 1.048764 0.931831 0.683035 
B 0.390697 0.359644 1.639314 0.030373 0.870753 
C 0.469710 0.838864 0.598927 0.704314 0.136061 
D 1.374944 0.626980 1.396455 0.806875 0.288331 
E 0.673555 0.928636 1.531849 0.636134 0.353484 
+1

我想我在很多个月前都给出了同样的答案。 – piRSquared

1

溶液1
numpy.dot + pandas.get_dummies

cols = df2.columns.values 
pd.DataFrame(
    df2.values.dot(pd.get_dummies(cols).values), 
    df2.index, pd.unique(df2.columns.values) 
) 

    0 1 2 3 4 
A 1 3 1 3 3 
B 1 3 1 3 3 
C 1 3 1 3 3 
D 1 3 1 3 3 
E 1 3 1 3 3 

溶液2
numpy.einsum + pandas.get_dummies

cols = df2.columns.values 
pd.DataFrame(
    np.einsum('ij,jk->ik', df2.values, pd.get_dummies(cols).values), 
    df2.index, pd.unique(df2.columns.values) 
) 

    0 1 2 3 4 
A 1 3 1 3 3 
B 1 3 1 3 3 
C 1 3 1 3 3 
D 1 3 1 3 3 
E 1 3 1 3 3 

幼稚定时

enter image description here

设置

df2 = pd.DataFrame(
    [[1, 1, 1, 1, 1, 2, 2, 2]], 
    list('ABCDE'), 
    [0, 1, 2, 3, 4, 1, 3, 4] 
)