2015-10-22 51 views
3

我有两个数据帧(logsfailures),我想合并,以便我在logs中添加一个包含'失败'中找到的最接近日期值的列。熊猫合并数据帧到最近的时间

的代码来生成logsfailures,和所需output低于:

import pandas as pd 
logs=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4])}) 
logs['date-time']=pd.to_datetime(logs['date-time']) 
failures=pd.DataFrame({'date':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00']),'failure':pd.Series([1,1,1])}) 
failures['date']=pd.to_datetime(failures['date']) 
output=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4]),'closest_failure':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00'])}) 
output['date-time']=pd.to_datetime(output['date-time']) 

任何想法?真正的数据集非常大,所以效率也是一个问题。

回答

3

您可以使用method =“nearest”重新索引。有可能是一个更合适的方法,但是使用与索引的故障日志和值的系列作品:

In [11]: failures_dt = pd.Series(failures["date"].values, failures["date"]) 

In [12]: failures_dt.reindex(logs["date-time"], method="nearest") 
Out[12]: 
date-time 
2015-10-23 10:20:54 2015-10-23 
2015-10-22 09:51:32 2015-10-22 
2015-10-21 06:51:32 2015-10-21 
2015-10-28 16:59:32 2015-10-23 
2015-10-25 04:41:32 2015-10-23 
2015-10-24 11:50:11 2015-10-23 
dtype: datetime64[ns] 

In [13]: logs["nearest"] = failures_dt.reindex(logs["date-time"], method="nearest").values 

In [14]: logs 
Out[14]: 
      date-time var1 nearest 
0 2015-10-23 10:20:54  0 2015-10-23 
1 2015-10-22 09:51:32  1 2015-10-22 
2 2015-10-21 06:51:32  3 2015-10-21 
3 2015-10-28 16:59:32  1 2015-10-23 
4 2015-10-25 04:41:32  2 2015-10-23 
5 2015-10-24 11:50:11  4 2015-10-23 
1

在熊猫> = 0.19.0您现在可以使用pandas.merge_asof,要接近一致。在0.19的情况下,您只能在取得最新的失败值之前或取得对数值。然而,with 0.20你可以在任何方向上得到最近的。

执行自动合并。这与左连接类似,除了我们 匹配最近的键而不是相等的键。

对于左边的DataFrame中的每一行,我们选择 右边的DataFrame的'on'键小于或等于左边的 键的最后一行。这两个DataFrames必须按键排序。

In [3]: failures.sort_values("date", inplace=True) 

In [6]: logs2=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50 
    ...: :11', "20/10/2015 01:02:03"]),'var1':pd.Series([0,1,3,1,2,4, 99])}) 
    ...: 

In [7]: logs2['date-time']=pd.to_datetime(logs2['date-time']) 

In [8]: logs2.sort_values("date-time", inplace=True) 

In [9]: logs2 
Out[9]: 
      date-time var1 
6 2015-10-20 01:02:03 99 
2 2015-10-21 06:51:32  3 
1 2015-10-22 09:51:32  1 
0 2015-10-23 10:20:54  0 
5 2015-10-24 11:50:11  4 
4 2015-10-25 04:41:32  2 
3 2015-10-28 16:59:32  1 

In [10]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date") 
Out[10]: 
      date-time var1  date failure 
0 2015-10-20 01:02:03 99  NaT  NaN 
1 2015-10-21 06:51:32  3 2015-10-21  1.0 
2 2015-10-22 09:51:32  1 2015-10-22  1.0 
3 2015-10-23 10:20:54  0 2015-10-23  1.0 
4 2015-10-24 11:50:11  4 2015-10-23  1.0 
5 2015-10-25 04:41:32  2 2015-10-23  1.0 
6 2015-10-28 16:59:32  1 2015-10-23  1.0 

In [11]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date", direction="nearest") 
Out[11]: 
      date-time var1  date failure 
0 2015-10-20 01:02:03 99 2015-10-21  1 
1 2015-10-21 06:51:32  3 2015-10-21  1 
2 2015-10-22 09:51:32  1 2015-10-22  1 
3 2015-10-23 10:20:54  0 2015-10-23  1 
4 2015-10-24 11:50:11  4 2015-10-23  1 
5 2015-10-25 04:41:32  2 2015-10-23  1 
6 2015-10-28 16:59:32  1 2015-10-23  1