2016-02-29 66 views
2

我试图找到以下帧的两列之间的时间差异:查找DataFrame中两列之间的时间差

测试日期|测试类型|初次使用日期


我用下面的函数定义中,以区别:

def days_between(d1, d2): 
    d1 = datetime.strptime(d1, "%Y-%m-%d") 
    d2 = datetime.strptime(d2, "%Y-%m-%d") 
    return abs((d2 - d1).days) 

并能正常工作,但它不采取一系列作为输入。所以我不得不建立一个for循环遍历指数:

age_veh = [] 
for i in range(0, len(data_manufacturer)-1): 
    age_veh[i].append(days_between(data_manufacturer.iloc[i,0], data_manufacturer.iloc[i,4])) 

但是,它返回一个错误: IndexError:列表索引超出范围

我不知道它是否是正确的方式做什么,我做错了什么或替代解决方案将不胜感激。请记住我有大约2百万行。

+2

为什么你不只是将列转换为日期时间,然后只是减去列? 'df ['Test Date'] = pd.to_datetime(df ['Test Date']'等等,然后'df ['Test Date'] - df ['First Use Date']'会返回一个timedelta – EdChum

+0

应该这样做,谢谢! –

回答

0

IIUC你可以先转换柱to_datetime,使用abs然后转换timedeltadays

print df 
    id value  date1  date2 sum 
0 A 150 2014-04-08 2014-03-08 NaN 
1 B 100 2014-05-08 2014-02-08 NaN 
2 B 200 2014-01-08 2014-07-08 100 
3 A 200 2014-04-08 2014-03-08 NaN 
4 A 300 2014-06-08 2014-04-08 350 

df['date1'] = pd.to_datetime(df['date1']) 
df['date2'] = pd.to_datetime(df['date2']) 

df['diff'] = (df['date1'] - df['date2']).abs()/np.timedelta64(1, 'D') 
print df 
    id value  date1  date2 sum diff 
0 A 150 2014-04-08 2014-03-08 NaN 31 
1 B 100 2014-05-08 2014-02-08 NaN 89 
2 B 200 2014-01-08 2014-07-08 100 181 
3 A 200 2014-04-08 2014-03-08 NaN 31 
4 A 300 2014-06-08 2014-04-08 350 61 

编辑

我觉得更好的是使用在较大DataFrames转换np.timedelta64(1, 'D')days,因为它更快:

我用EdCh嗯sample,只有len(df) = 4k

import io 
import pandas as pd 
import numpy as np 

t=u"""Test Date,Test Type,First Use Date 
2011-02-05,A,2010-01-05 
2012-02-05,A,2010-03-05 
2013-02-05,A,2010-06-05 
2014-02-05,A,2010-08-05""" 

df = pd.read_csv(io.StringIO(t)) 

df = pd.concat([df]*1000).reset_index(drop=True) 

df['Test Date'] = pd.to_datetime(df['Test Date']) 
df['First Use Date'] = pd.to_datetime(df['First Use Date']) 

print (df['Test Date'] - df['First Use Date']).abs().dt.days 

print (df['Test Date'] - df['First Use Date']).abs()/np.timedelta64(1, 'D') 

时序

In [174]: %timeit (df['Test Date'] - df['First Use Date']).abs().dt.days 
10 loops, best of 3: 38.8 ms per loop 

In [175]: %timeit (df['Test Date'] - df['First Use Date']).abs()/np.timedelta64(1, 'D') 
1000 loops, best of 3: 1.62 ms per loop 
2

使用to_datetime那么你可以减去列产生对abstimedelta转换列,那么你可以调用dt.days到获得总天数,例如:

In [119]: 
import io 
import pandas as pd 
t="""Test Date,Test Type,First Use Date 
2011-02-05,A,2010-01-05 
2012-02-05,A,2010-03-05 
2013-02-05,A,2010-06-05 
2014-02-05,A,2010-08-05""" 
df = pd.read_csv(io.StringIO(t)) 
df 
Out[119]: 
    Test Date Test Type First Use Date 
0 2011-02-05   A  2010-01-05 
1 2012-02-05   A  2010-03-05 
2 2013-02-05   A  2010-06-05 
3 2014-02-05   A  2010-08-05 

In [121]:  
df['Test Date'] = pd.to_datetime(df['Test Date']) 
df['First Use Date'] = pd.to_datetime(df['First Use Date']) 
df.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 4 entries, 0 to 3 
Data columns (total 3 columns): 
Test Date   4 non-null datetime64[ns] 
Test Type   4 non-null object 
First Use Date 4 non-null datetime64[ns] 
dtypes: datetime64[ns](2), object(1) 
memory usage: 128.0+ bytes 

In [122]: 
df['days'] = (df['Test Date'] - df['First Use Date']).abs().dt.days 
df 

Out[122]: 
    Test Date Test Type First Use Date days 
0 2011-02-05   A  2010-01-05 396 
1 2012-02-05   A  2010-03-05 702 
2 2013-02-05   A  2010-06-05 976 
3 2014-02-05   A  2010-08-05 1280