2016-03-29 97 views
7

我有我已经设定的日期为DateTime指数df.set_index(pd.to_datetime(df['date']), inplace=True),并希望合并或加入的日期下面的两个dataframes:熊猫:合并数据帧上的日期时间指数

df.head(5) 
     catcode_amt type feccandid_amt amount 
date     
1915-12-31 A5000 24K  H6TX08100 1000 
1916-12-31 T6100 24K  H8CA52052 500 
1954-12-31 H3100 24K  S8AK00090 1000 
1985-12-31 J7120 24E  H8OH18088 36 
1997-12-31 z9600 24K  S6ND00058 2000 


d.head(5) 
     catcode_disp disposition feccandid_disp bills 
date     
2007-12-31 A0000 support  S4HI00011    1 
2007-12-31 A1000 oppose  S4IA00020', 'P20000741 1 
2007-12-31 A1000 support  S8MT00010    1 
2007-12-31 A1500 support  S6WI00061    2 
2007-12-31 A1600 support  S4IA00020', 'P20000741 3 

我曾尝试以下两种方法但都返回一个MemoryError:

df.join(d, how='right') 

我使用下面的代码没有日期设置为索引的数据帧的代码。

merge=pd.merge(df,d, how='inner', on='date') 
+2

这里你的问题是你有重复的日期,所以重复行的组合是introduci ng很多额外的行,因为没有1-1映射 – EdChum

+0

谢谢EdChum!我没有注意到问题的根源。我决定在'd'上放置日期,并在'catcode'上合并。它工作正常! –

回答

6

您可以添加参数left_index=Trueright_index=True如果你需要通过指数函数merge合并:

merge=pd.merge(df,d, how='inner', left_index=True, right_index=True) 

样品(在指数的第一个值d改为匹配):

print df 
      catcode_amt type feccandid_amt amount 
date            
1915-12-31  A5000 24K  H6TX08100 1000 
1916-12-31  T6100 24K  H8CA52052  500 
1954-12-31  H3100 24K  S8AK00090 1000 
1985-12-31  J7120 24E  H8OH18088  36 
1997-12-31  z9600 24K  S6ND00058 2000 

print d 
      catcode_disp disposition   feccandid_disp bills 
date                 
1997-12-31  A0000  support     S4HI00011 1.0 
2007-12-31  A1000  oppose S4IA00020', 'P20000741 1 NaN 
2007-12-31  A1000  support     S8MT00010 1.0 
2007-12-31  A1500  support     S6WI00061 2.0 
2007-12-31  A1600  support S4IA00020', 'P20000741 3 NaN 

merge=pd.merge(df,d, how='inner', left_index=True, right_index=True) 
print merge 
      catcode_amt type feccandid_amt amount catcode_disp disposition \ 
date                   
1997-12-31  z9600 24K  S6ND00058 2000  A0000  support 

      feccandid_disp bills 
date        
1997-12-31  S4HI00011 1.0 

或者你可以使用concat

print pd.concat([df,d], join='inner', axis=1) 

date                   
1997-12-31  z9600 24K  S6ND00058 2000  A0000  support 

      feccandid_disp bills 
date        
1997-12-31  S4HI00011 1.0 

编辑:EdChum是正确的:

我添加复制到数据框df(最后2个索引值):

print df 
      catcode_amt type feccandid_amt amount 
date            
1915-12-31  A5000 24K  H6TX08100 1000 
1916-12-31  T6100 24K  H8CA52052  500 
1954-12-31  H3100 24K  S8AK00090 1000 
2007-12-31  J7120 24E  H8OH18088  36 
2007-12-31  z9600 24K  S6ND00058 2000 

print d 
      catcode_disp disposition   feccandid_disp bills 
date                 
1997-12-31  A0000  support     S4HI00011 1.0 
2007-12-31  A1000  oppose S4IA00020', 'P20000741 1 NaN 
2007-12-31  A1000  support     S8MT00010 1.0 
2007-12-31  A1500  support     S6WI00061 2.0 
2007-12-31  A1600  support S4IA00020', 'P20000741 3 NaN 

merge=pd.merge(df,d, how='inner', left_index=True, right_index=True) 
print merge 
      catcode_amt type feccandid_amt amount catcode_disp disposition \ 
date                   
2007-12-31  J7120 24E  H8OH18088  36  A1000  oppose 
2007-12-31  J7120 24E  H8OH18088  36  A1000  support 
2007-12-31  J7120 24E  H8OH18088  36  A1500  support 
2007-12-31  J7120 24E  H8OH18088  36  A1600  support 
2007-12-31  z9600 24K  S6ND00058 2000  A1000  oppose 
2007-12-31  z9600 24K  S6ND00058 2000  A1000  support 
2007-12-31  z9600 24K  S6ND00058 2000  A1500  support 
2007-12-31  z9600 24K  S6ND00058 2000  A1600  support 

         feccandid_disp bills 
date           
2007-12-31 S4IA00020', 'P20000741 1 NaN 
2007-12-31     S8MT00010 1.0 
2007-12-31     S6WI00061 2.0 
2007-12-31 S4IA00020', 'P20000741 3 NaN 
2007-12-31 S4IA00020', 'P20000741 1 NaN 
2007-12-31     S8MT00010 1.0 
2007-12-31     S6WI00061 2.0 
2007-12-31 S4IA00020', 'P20000741 3 NaN 
+0

@ jezrael:我只是尝试了你推荐的代码:我仍然得到一个MemoryError。你还有其他建议吗? –

+1

“RAM”的大小是多少?什么是你的DataFrame的形状? 'print df.shape'和'print d.shape'? – jezrael

+0

我的'df.shape''(389194,4)'和我的'd.shape是(2910,4)' –

2

它看起来像你的日期是你的指数,在这种情况下,你想合并的索引,而不是列。如果你有两个dataframes,df_1df_2

df_1.merge(df_2, left_index=True, right_index=True, how='inner')

+0

感谢您的建议。我刚刚尝试过,并且仍然收到MemoryError。你还有其他建议吗? –

+0

用两个数据框来尝试它,这两个数据框是数据的一个小子集 - 比方说每个数据集的最后100行。 – dmb