2016-02-02 60 views
3

什么是操作熊猫数据框中的日期字段的最快方法,例如将日期的日期值替换为月份的最后一天。目前我可以做以下事情,但需要很长时间才能运行。操作日期字段熊猫

import calendar 
consumption_data_monthly.DATE = consumption_data_monthly.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1])) 
+1

IIUC你可以使用'consumption_data_monthly.DATE.dt.days_in_month'应该在考虑的月份中返回最多天数。 –

回答

2

我认为calendar.monthrange非常有效,速度很快,但矢量化速度更快。

您可以尝试通过valuesastype将列DATE到一个月的numpy阵列,再加入下一month和。减去一个day

df['DATE'] = df['DATE'].values.astype('datetime64[M]') + 
      np.array([1], dtype='timedelta64[M]') - 
      np.array([1], dtype='timedelta64[D]') 

定时len(df)=70000

In [468]: %timeit one(df) 
1 loops, best of 3: 881 ms per loop 

In [469]: %timeit two(df1) 
1 loops, best of 3: 733 ms per loop 

In [470]: %timeit three(df2) 
1 loops, best of 3: 1.24 s per loop 

In [471]: %timeit four(df3) 
100 loops, best of 3: 6.61 ms per loop 

In [472]: %timeit five(df4) 
100 loops, best of 3: 8.76 ms per loop 

代码:

import pandas as pd 
import numpy as np 
import calendar 
import datetime 
from pandas.tseries.offsets import * 

d = {'DATE': {0: pd.Timestamp('2012-01-05 00:00:00'), 1: pd.Timestamp('2012-02-08 00:00:00'), 2: pd.Timestamp('2012-03-11 00:00:00'), 3: pd.Timestamp('2012-04-06 00:00:00'), 4: pd.Timestamp('2012-05-04 00:00:00'), 5: pd.Timestamp('2012-06-20 00:00:00'), 6: pd.Timestamp('2012-07-09 00:00:00')}} 
df = pd.DataFrame(d) 
print df 

df = pd.concat([df]*10000).reset_index(drop=True) 
df1 = df.copy() 
df2 = df.copy() 
df3 = df.copy() 
df4 = df.copy() 

def one(df): 
    df.DATE = df.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1])) 
    return df 

def two(df):  
    df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, calendar.monthrange(x.year,x.month)[1])) 
    return df 

def three(df):  
    df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, x.days_in_month)) 
    return df 

def four(df): 
    df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]') 
    return df 

def five(df):  
    df['DATE'] = df['DATE'] + MonthEnd() 
    return df 

print one(df).head() 
print two(df1).head() 
print three(df2).head() 
print four(df4).head() 

定时len(df)=7

In [475]: %timeit one(df) 
The slowest run took 11.16 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 379 µs per loop 

In [476]: %timeit two(df1) 
The slowest run took 11.93 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 336 µs per loop 

In [477]: %timeit three(df2) 
1000 loops, best of 3: 398 µs per loop 

In [478]: %timeit four(df3) 
The slowest run took 19.07 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 159 µs per loop 

In [479]: %timeit five(df4) 
The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.26 ms per loop 
1

使用DateOffset到月末添加到您的日期:

In [25]: 
df['DATE'] + MonthEnd() 
from pandas.tseries.offsets import * 
df['DATE'] + MonthEnd() 

Out[25]: 
0 2012-01-31 
1 2012-02-29 
2 2012-03-31 
3 2012-04-30 
4 2012-05-31 
5 2012-06-30 
6 2012-07-31 
Name: DATE, dtype: datetime64[ns] 

时序

In [26]: 
def four(df): 
    df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]') 
    return df 
​ 
%timeit four(df) 
%timeit df['DATE'] = MonthEnd() 
1000 loops, best of 3: 206 µs per loop 
The slowest run took 272.78 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 139 µs per loop 

你可以看到,使用偏移比快建议的解决方案

在一个70K行DF的时序为:

100 loops, best of 3: 5.69 ms per loop 
100 loops, best of 3: 8 ms per loop 

所以其他的解决办法是更快更大的DFS,这里的语法是清洁

+1

我认为它在小型数据框中速度更快,而不是更大。 – jezrael

+0

@jezrael增加了新的时机,是的你的解决方案更快,通常使用纯np将永远击败熊猫权衡将是一些更好的'NaN'处理和语法 – EdChum