2017-03-09 72 views
4

如何组合三个数据帧,如下所示?根据条件合并3个不同的数据帧

前两个的主要关系必须基于ID1,因为它是两个数据框之间的匹配关系。

第三个数据帧,地址2必须以哈希添加

DF1匹配:

Name1 Name2 Name3 Address ID1  ID2 Own 
Matt John1 Jill  878 home 1  0  Deal 
Matt John2 Jack  879 home 2  1  Dael 

DF2:

Name1 ID1 Address Name4  Address2 
Matt 1  878 home face1  face\123 
Matt 1  878 home face2  face\345 
Matt 1  878 home face3  face\678  
Matt 2  879 home head1  head\123 
Matt 2  879 home head2  head\345 
Matt 2  879 home head3  head\678 

DF3:

Address2  Hash 
face\123  abc123 
face\345  cde321 
face\678  efg123 
head\123  123efg 
head\345  efg321 
head\678  acd321 

我正试图结合3个dataframes成一个象下面这样:

Name1 Name2 ID1 Address  Own Name3 ID2 Name4 Address2 Hash 
Matt John1 1 878 home Deal Jill 0 face1 face\123 abc123 
Matt John1 1 878 home Deal Jill 0 face2 face\345 cde321 
Matt John1 1 878 home Deal Jill 0 face3 face\678 efg123 
Matt John2 2 879 home Dael Jack 1 head1 head\123 123efg 
Matt John2 2 879 home Dael Jack 1 head2 head\345 efg321 
Matt John2 2 879 home Dael Jack 1 head3 head\678 acd321 

DF1之间和DF2关键是到ID1 DF2之间和DF3关键是地址2

非常感谢您的帮助。

+1

你不就是在这里合并列交叉吗? 'df1.merge(DF2).merge(DF3)'? – miradulo

回答

1

看看merge函数,可以找到一些例子here。针对您的特定问题,请尝试以下操作:

combined_df = df1.merge(df2, on="Id1", how="inner").merge(df3, on="Adress2", how="inner") 
0

我认为这会起作用。 Ther合并函数几乎适合您想要加入的列。

import numpy as np 
import pandas as pd 

data = np.array([['Name1','Name2','Name3','Address','ID1','ID2','Own'], 
       ['Matt','John1','Jill','878 home','1','0','Deal'], 
       ['Matt', 'John2', 'Jack', '879 home', '2', '1', 'Dael']]) 

data2 = np.array([['Name1','ID1','Address','Name4','Address2'], 
       ['Matt', '1','878 home','face1',"face.123"], 
       ['Matt', '1','878 home', 'face2','face.345'], 
        ['Matt', '1','878 home', 'face3', 'face.678'], 
        ['Matt', '2', '879 home', 'head1', 'head.123'], 
        ['Matt', '2', '879 home', 'head2', 'head.345'], 
        ['Matt', '2', '879 home', 'head3', 'head.678']]) 
#print(data) 
data3 = np.array([['Address2','Hash'], 
       ['face.123', 'abc123'], 
       ['face.345','cde321'], 
       ['face.678', 'efg123'], 
       ['head.123', '123efg'], 
       ['head.345', 'efg321'], 
       ['head.678', 'acd321']]) 

df1 = pd.DataFrame(data=data[1:,:], columns=data[0,:]) 
df2 = pd.DataFrame(data=data2[1:,:], columns=data2[0,:]) 
df3 = pd.DataFrame(data=data3[1:,:], columns=data3[0,:]) 


Cdf= pd.merge(df1,df2, on='ID1', how='inner') 
Ddf = pd.merge(Cdf,df3, on = 'Address2', how='inner') 
print(Ddf) 
0

从你期望的输出,你似乎并不需要任何规范超出列交叉融合是自动进行的。

>>> df1.merge(df2).merge(df3) 

    Name1 Name2 Name3 Address ID1 ID2 Own Name4 Address2 Hash 
0 Matt John1 Jill 878 home 1 0 Deal face1 face\123 abc123 
1 Matt John1 Jill 878 home 1 0 Deal face2 face\345 cde321 
2 Matt John1 Jill 878 home 1 0 Deal face3 face\678 efg123 
3 Matt John2 Jack 879 home 2 1 Dael head1 head\123 123efg 
4 Matt John2 Jack 879 home 2 1 Dael head2 head\345 efg321 
5 Matt John2 Jack 879 home 2 1 Dael head3 head\678 acd321 

指定单数列作为接受的答案进行合并确实会导致问题,因为您将有后缀列。

>>> df1.merge(df2, on="ID1", how="inner").merge(df3, on="Address2", how="inner") 

    Name1_x Name2 Name3 Address_x ID1 ID2 Own Name1_y Address_y Name4 \ 
0 Matt John1 Jill 878home 1 0 Deal Matt 878home face1 
1 Matt John1 Jill 878home 1 0 Deal Matt 878home face2 
2 Matt John1 Jill 878home 1 0 Deal Matt 878home face3 
3 Matt John2 Jack 879home 2 1 Dael Matt 879home head1 
4 Matt John2 Jack 879home 2 1 Dael Matt 879home head2 
5 Matt John2 Jack 879home 2 1 Dael Matt 879home head3 

    Address2 Hash 
0 face\123 abc123 
1 face\345 cde321 
2 face\678 efg123 
3 head\123 123efg 
4 head\345 efg321 
5 head\678 acd321