2016-01-14 47 views
3

我有两个熊猫数据帧df1和df2。我希望df1使用左外连接与df2结合,但使用包含“df2.Full_Key”中的“df2.Partial_key”功能的函数。左外连接熊猫数据帧使用包含

Select df1.data_id1, df1.Full_Key, df1.text_field 
, df2.data_id2, df2.text_field 
from df1 
LEFT OUTER JOIN df2 on "df1.Full_Key contains df2.Partial_key" 

有没有办法做到这一点没有for循环?鉴于

df1 = pd.DataFrame.from_items([('data_id1' , ['bzx_0001','bzx_0002','bzx_0003','bzx_0004']) 
, ('Full_Key_1',['AAAA-BBBB-20150101-NS237890', 'BBBB-CCCC-21050101-MS18546', 'CCCC-CCCC-20150101-MS34567', 'CCCC-CCCC-20150101-MS34568']) 
, ('text_field',['aaaaa', 'bbbbb', 'cccccc', 'ddddd'])]) 

df2 = pd.DataFrame.from_items([('data_id2',['dm_0001', 'dm_0002', 'dm_0003', 'dm_0004']) 
,('Partial_key',['AAAA-BBBB-20150101-', 'AAAA-BBBB-20150101-', 'BBBB-CCCC-21050101-', 'XXXX-XXXX-20150101-']) 
]) 

数据帧预期加盟后:使用循环

df_exp_res = pd.DataFrame.from_items([ 
('data_id1', ['bzx_0001', 'bzx_0001', 'bzx_0002', 'bzx_0003', 'bzx_0004']) 
,('Full_Key_1',['AAAA-BBBB-20151005-NS237890', 'AAAA-BBBB-20151005-NS237890', 'BBBB-CCCC-21050101-MS18546', 'CCCC-CCCC-20150101-MS34567', 'CCCC-CCCC-20150101-MS34568']) 
,('text_field',['aaaaa', 'aaaaa', 'bbbbb', 'cccccc', 'ddddd']) 
,('data_id2', ['dm_0001', 'dm_0002', 'dm_0003', np.nan, np.nan]) 
,('Partial_key',['AAAA-BBBB-20151005-', 'AAAA-BBBB-20151005-', 'BBBB-CCCC-21050101-', np.nan, np.nan]) 
]) 

我的解决办法:

s = [['data_id1' , 'Full_Key_1', 'text_field', 'Partial_key', 'data_id2']] 
for indx1, row1 in df1.iterrows(): 
    fnd = False 
    for indx2, row2 in df2.iterrows(): 
     if row2['Partial_key'].strip() in row1['Full_Key_1'].strip(): 
      s.append([row1['data_id1'],row1['Full_Key_1'], \ 
      row1['text_field'], row2['Partial_key'], \ 
      row2['data_id2']]) 
      fnd = True 
     else: 
      pass 
    else: 
     if not fnd: 
      s.append([row1['data_id1'],row1['Full_Key_1'], \ 
      row1['text_field'], np.nan, np.nan]) 

pd_result_calc = pd.DataFrame(s[1:],columns=s[0]) 
print df1 
print df2 
print pd_result_calc 
+0

'Partial_key'总是截断'Full_ke y's?他们总是19个字符? – unutbu

+0

Partial_Keys和Full_Keys没有固定长度。只要整个Partial_Key包含在被认为匹配的Full_Key中。但是,是的,Partial_Key将始终是表格的Full_Key的截断:Partial_key = Full_Key [0:some_n]其中0

回答

0

基于交叉联接 - 见cartesian product in pandas

df1 = pd.DataFrame.from_items([('data_id1' , ['bzx_0001','bzx_0002','bzx_0003','bzx_0004']) 
, ('Full_Key_1',['AAAA-BBBB-20150101-NS237890', 'BBBB-CCCC-21050101-MS18546', 'CCCC-CCCC-20150101-MS34567', 'CCCC-CCCC-20150101-MS34568']) 
, ('text_field',['aaaaa', 'bbbbb', 'cccccc', 'ddddd'])]) 

df2 = pd.DataFrame.from_items([('data_id2',['dm_0001', 'dm_0002', 'dm_0003', 'dm_0004']) 
,('Partial_key',['AAAA-BBBB-20150101-', 'AAAA-BBBB-20150101-', 'BBBB-CCCC-21050101-', 'XXXX-XXXX-20150101-']) 
]) 

df1['key'] =1 
df2['key'] =1 

merged_cross_join = pd.merge(df1, df2,on='key') 

# we don't need this helper column 'key' any longer 
merged_cross_join.drop('key', axis=1, inplace=True) 
df1.drop('key', axis=1, inplace=True) 

contains_criteria = merged_cross_join[['Full_Key_1','Partial_key']].apply(lambda x: x['Partial_key'] in x['Full_Key_1'],axis=1) 
print merged_cross_join[contains_criteria] 

将会产生:

data_id1     Full_Key_1 text_field key data_id2   Partial_key 
0 bzx_0001 AAAA-BBBB-20150101-NS237890  aaaaa 1 dm_0001 AAAA-BBBB-20150101- 
1 bzx_0001 AAAA-BBBB-20150101-NS237890  aaaaa 1 dm_0002 AAAA-BBBB-20150101- 
6 bzx_0002 BBBB-CCCC-21050101-MS18546  bbbbb 1 dm_0003 BBBB-CCCC-21050101- 

,然后因为你要像一个 “左外连接:” 我们不希望从DF1

not_matched_in_df1 = set(df1['data_id1']) - set(merged_cross_join['data_id1']) 
final = pd.concat([merged_cross_join,df1[df1['data_id1'].isin(not_matched_in_df1)]],axis=0) 

或可替代

merged_cross_join.combine_first(df1) 

产生

任何松动
data_id1     Full_Key_1 text_field data_id2   Partial_key 
0 bzx_0001 AAAA-BBBB-20151005-NS237890  aaaaa dm_0001 AAAA-BBBB-20151005- 
1 bzx_0001 AAAA-BBBB-20151005-NS237890  aaaaa dm_0002 AAAA-BBBB-20151005- 
2 bzx_0002 BBBB-CCCC-21050101-MS18546  bbbbb dm_0003 BBBB-CCCC-21050101- 
3 bzx_0003 CCCC-CCCC-20150101-MS34567  cccccc  NaN     NaN 
4 bzx_0004 CCCC-CCCC-20150101-MS34568  ddddd  NaN     NaN 
+0

这会丢弃df1.Full_Keys没有找到匹配的左外部联接部分。 –

+0

看到现在更新。 – Dickster