2016-04-21 64 views
1

考虑以下两个DataFrames:合并MultIndex DataFrames

arrays1 = [['foo', 'bar', 'bar', 'bar'], 
      ['A', 'D', 'E', 'F']] 
tuples1 = list(zip(*arrays1))   
columnValues1 = pd.MultiIndex.from_tuples(tuples1) 
df1 = pd.DataFrame(np.random.rand(4,4), columns = columnValues1) 
print(df1) 
     foo  bar      
      A   D   E   F 
0 0.833444 0.354676 0.468294 0.173005 
1 0.409730 0.275342 0.595433 0.322785 
2 0.515161 0.340063 0.117509 0.491957 
3 0.285594 0.970524 0.322902 0.628351 

arrays2 = [['foo', 'foo', 'bar', 'bar'], 
      ['B', 'C', 'G', 'H']] 
tuples2 = list(zip(*arrays2))   
columnValues2 = pd.MultiIndex.from_tuples(tuples2) 
df2 = pd.DataFrame(np.random.rand(4,4), columns = columnValues2) 
print(df2) 
     foo     bar   
      B   C   G   H 
0 0.208822 0.762884 0.424412 0.583324 
1 0.767560 0.884583 0.716843 0.329719 
2 0.147991 0.424748 0.560599 0.828155 
3 0.376050 0.436354 0.704379 0.406324 

说我想合并这些得到这个:

  foo        bar     
      A   B   C   D   E   F   G   H 
0 0.833444 0.208822 0.762884 0.354676 0.468294 0.173005 0.424412 0.583324 
1 0.409730 0.767560 0.884583 0.275342 0.595433 0.322785 0.716843 0.329719 
2 0.515161 0.147991 0.424748 0.340063 0.117509 0.491957 0.560599 0.828155 
3 0.285594 0.376050 0.436354 0.970524 0.322902 0.628351 0.704379 0.406324 

我试图通过合并:

pd.merge(df1.reset_index(), df2.reset_index(), on=df1.columns.levels[0], 
how='inner').set_index(df1.columns.levels[0]) 

不幸的是我收到以下错误信息:

ValueError: The truth value of an array with more than one element is ambiguous. 
Use a.any() or a.all() 

如何合并2个MultiIndex DataFrame? `

回答

1

UPDATE:动态选择列:

In [57]: join = df1.join(df2) 

In [58]: cols = join.columns.get_level_values(0).unique() 

In [59]: cols 
Out[59]: array(['foo', 'bar'], dtype=object) 

In [60]: join = join[cols] 

In [61]: join 
Out[61]: 
     foo       bar        \ 
      A   B   C   D   E   F   G 
0 0.176934 0.694937 0.947164 0.510407 0.085626 0.162183 0.382840 
1 0.973283 0.743907 0.886495 0.028961 0.740759 0.330742 0.961932 
2 0.898224 0.966278 0.131551 0.517563 0.026104 0.624047 0.848640 
3 0.713660 0.704461 0.419997 0.718130 0.252294 0.336838 0.016916 


      H 
0 0.929695 
1 0.444762 
2 0.338168 
3 0.635817 

joined = df1.join(df2)[['foo','bar']] 

说明:

可以先加入您的DF的:

In [47]: join = df1.join(df2) 

In [48]: join 
Out[48]: 
     foo  bar       foo     bar \ 
      A   D   E   F   B   C   G 
0 0.176934 0.510407 0.085626 0.162183 0.694937 0.947164 0.382840 
1 0.973283 0.028961 0.740759 0.330742 0.743907 0.886495 0.961932 
2 0.898224 0.517563 0.026104 0.624047 0.966278 0.131551 0.848640 
3 0.713660 0.718130 0.252294 0.336838 0.704461 0.419997 0.016916 


      H 
0 0.929695 
1 0.444762 
2 0.338168 
3 0.635817 

和然后在des中选择列(级别:0) ired order:

In [49]: join = join[['foo','bar']] 

In [50]: join 
Out[50]: 
     foo       bar        \ 
      A   B   C   D   E   F   G 
0 0.176934 0.694937 0.947164 0.510407 0.085626 0.162183 0.382840 
1 0.973283 0.743907 0.886495 0.028961 0.740759 0.330742 0.961932 
2 0.898224 0.966278 0.131551 0.517563 0.026104 0.624047 0.848640 
3 0.713660 0.704461 0.419997 0.718130 0.252294 0.336838 0.016916 


      H 
0 0.929695 
1 0.444762 
2 0.338168 
3 0.635817 
+0

实际上,0级除了'foo'和'bar'之外还有更多的标签。有什么办法可以传递df1的列顺序吗? – BdB

+0

@BdB,当然,请参阅我的回答中的“更新” – MaxU

+0

太棒了,感谢您的更新! – BdB

1

这不是一个真正的“合并”,因为你并不真正匹配数据框之间的值,你只是并排添加一些列。所以pd.concat做了你需要的东西:

combined = pd.concat([df1, df2], axis=1) 
combined.sort_index(axis=1, inplace=True) 

combined 
Out[13]: 
     bar            foo   \ 
      D   E   F   G   H   A   B 
0 0.915879 0.712345 0.460795 0.529782 0.161578 0.803505 0.133896 
1 0.234319 0.317113 0.477687 0.525108 0.495104 0.107596 0.374732 
2 0.149397 0.244950 0.866735 0.501562 0.758321 0.508689 0.635703 
3 0.330018 0.204695 0.598899 0.522993 0.306496 0.936768 0.638874 


      C 
0 0.614592 
1 0.824297 
2 0.482161 
3 0.792035 
+0

这几乎可以工作,有没有办法保持'foo'和'bar'的顺序是? – BdB