2017-08-11 46 views
2
df=pd.DataFrame(np.random.randn(4,4), columns=['a','b','c','d']) 

    a  b  c  d 
0 1.23 -1.25 0.15 1.41 
1 0.64 -0.16 0.46 -1.06 
2 -0.98 0.54 -0.81 0.33 
3 1.71 0.95 1.01 -0.81 

如何创建包含所有可能的列对的数据帧?在熊猫中创建列对

p1 p2 c1  c2 
a b 1.23 -1.25 
a b 0.64 -0.16 
a b -0.98 0.54 
a b 1.71 0.95 
a c 1.23 0.15 
a c 0.64 0.46 
a c -0.98 -0.81 
a c 1.71 1.01 
... ... ... ... 
c d 0.15 1.41 
c d 0.46 -1.06 
c d -0.81 0.33 
c d 1.01 -0.81 

回答

2

用途:

  • 通过itertools得到列的所有组合:通过列表理解所有组合的
  • 获取列表和和重命名列: (df[[x[0], x[1]]]用于第一对df[['a', 'b']]
  • concat与参数keysMultiindex
  • 持续一段数据清洗 - 去除第三级,新的列名

from itertools import combinations 
cc = list(combinations(df.columns,2)) 

dfs = [df[[x[0], x[1]]].rename(columns={x[0]:'c1', x[1]:'c2'}) for x in cc] 
df1 = pd.concat(dfs, keys=cc) 
df1 = df1.reset_index(level=2, drop=True).rename_axis(('p1','p2')).reset_index() 
print (df1) 
    p1 p2 c1 c2 
0 a b 1.23 -1.25 
1 a b 0.64 -0.16 
2 a b -0.98 0.54 
3 a b 1.71 0.95 
4 a c 1.23 0.15 
5 a c 0.64 0.46 
6 a c -0.98 -0.81 
7 a c 1.71 1.01 
8 a d 1.23 1.41 
9 a d 0.64 -1.06 
10 a d -0.98 0.33 
11 a d 1.71 -0.81 
12 b c -1.25 0.15 
13 b c -0.16 0.46 
14 b c 0.54 -0.81 
15 b c 0.95 1.01 
16 b d -1.25 1.41 
17 b d -0.16 -1.06 
18 b d 0.54 0.33 
19 b d 0.95 -0.81 
20 c d 0.15 1.41 
21 c d 0.46 -1.06 
22 c d -0.81 0.33 
23 c d 1.01 -0.81 
+0

感谢@jezrael,它完美的作品! – HappyPy

+0

很高兴能帮到你!真的有趣的问题,谢谢;) – jezrael

0

只要你的数据帧不是太大,嵌套循环的作品不够好:

import pandas as pd 
import numpy as np 

df = pd.DataFrame(np.random.randn(4,4), columns=['a','b','c','d']) 

print(df) 
      a   b   c   d 
0 0.004477 -0.367254 -0.251733 -0.957313 
1 0.996096 0.879603 1.499766 0.386398 
2 -0.459716 0.186510 0.738449 -0.219747 
3 0.606211 0.077233 1.583994 0.824706 

master = pd.DataFrame() 
known_pairs = [] 
n = df.shape[0] 

for p1 in df: 
    for p2 in df: 
     pair = sorted((p1,p2)) 
     if (p1 != p2) & (pair not in known_pairs): 
      known_pairs.append(sorted((p1,p2))) 
      tmp = pd.DataFrame(np.array([p1,p2] * n).reshape(n, 2), columns=['p1','p2']) 
      tmp['c1'] = df[p1] 
      tmp['c2'] = df[p2] 
      master = pd.concat([master, tmp]) 

print(master) 
    p1 p2  c1  c2 
0 a b 0.004477 -0.367254 
1 a b 0.996096 0.879603 
2 a b -0.459716 0.186510 
3 a b 0.606211 0.077233 
0 a c 0.004477 -0.251733 
1 a c 0.996096 1.499766 
2 a c -0.459716 0.738449 
3 a c 0.606211 1.583994 
0 a d 0.004477 -0.957313 
1 a d 0.996096 0.386398 
2 a d -0.459716 -0.219747 
3 a d 0.606211 0.824706 
0 b c -0.367254 -0.251733 
1 b c 0.879603 1.499766 
2 b c 0.186510 0.738449 
3 b c 0.077233 1.583994 
0 b d -0.367254 -0.957313 
1 b d 0.879603 0.386398 
2 b d 0.186510 -0.219747 
3 b d 0.077233 0.824706 
0 c d -0.251733 -0.957313 
1 c d 1.499766 0.386398 
2 c d 0.738449 -0.219747 
3 c d 1.583994 0.824706 
2

既然你在两个方面要组合,你可以使用的列itertools.combinations然后cumcount ()来获取索引。希望它可以帮助

import itertools 

df=pd.DataFrame(np.random.randn(4,4), columns=['a','b','c','d']) 

r = list(itertools.combinations(df.columns.tolist(), 2)) 

new = pd.DataFrame(list(r*df.shape[0]),columns=['p1','p2']).sort_values(['p1','p2']).reset_index(drop=True) 

new['count'] = new.groupby(['p1','p2']).cumcount() 

new['c1'] = new.apply(lambda x: df.loc[x['count'],x['p1']],axis=1) 
new['c2'] = new.apply(lambda x: df.loc[x['count'],x['p2']],axis=1) 

new = new.drop('count',axis=1) 

输出:

 
    p1 p2  c1  c2 
0 a b -0.157408 -0.293641 
1 a b -0.205898 -0.527494 
2 a b -0.740385 1.058200 
3 a b 2.163202 0.584529 
4 a c -0.157408 0.824047 
5 a c -0.205898 0.016703 
. 
. 
. 
22 c d 0.260635 -0.958339 
23 c d -0.641043 -1.199849