2014-01-26 65 views
4

我想知道如何加快两个数据帧的合并。其中一个数据帧具有时间戳数据点(value列)。基于不规则时间间隔合并熊猫DataFrames

import pandas as pd 
import numpy as np 

data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)), 
        'value':np.random.uniform(-1,1,size=50)}) 

另一个具有时间间隔的信息(start_timeend_time,和相关联的interval_id)。

intervals = pd.DataFrame({'interval_id':np.arange(9), 
          'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),  
          'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)}) 

我想这两个dataframes更有效地合并超过以下for循环:

data['interval_id'] = np.nan 
for index, ser in intervals.iterrows(): 
    in_interval = (data['time'] >= ser['start_time']) & \ 
        (data['time'] <= ser['end_time']) 
    data['interval_id'][in_interval] = ser['interval_id'] 

result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True) 

我一直想象我就可以用熊猫time series functionality,如日期范围或TimeGrouper ,但我还没有找出比上述更多的pythonic(pandas-y?)。

结果举例:

 time  value  interval_id start_time end_time 
0 0.575976 0.022727   NaN   NaN  NaN 
1 4.607545 0.222568   0 3.618715 8.294847 
2 5.179350 0.438052   0 3.618715 8.294847 
3 11.069956 0.641269   1 10.301728 19.870283 
4 12.387854 0.344192   1 10.301728 19.870283 
5 18.889691 0.582946   1 10.301728 19.870283 
6 20.850469 -0.027436   NaN   NaN  NaN 
7 23.199618 0.731316   2 21.488868 28.968338 
8 26.631284 0.570647   2 21.488868 28.968338 
9 26.996397 0.597035   2 21.488868 28.968338 
10 28.601867 -0.131712   2 21.488868 28.968338 
11 28.660986 0.710856   2 21.488868 28.968338 
12 28.875395 -0.355208   2 21.488868 28.968338 
13 28.959320 -0.430759   2 21.488868 28.968338 
14 29.702800 -0.554742   NaN   NaN  NaN 

从时间序列,精明的人有什么建议在那里将不胜感激。


更新,杰夫的回答后:

的主要问题是,interval_id没有就任何固定的时间间隔(例如,间隔并不总是约10秒)。一个间隔可能是10秒,接下来可能是2秒,接下来可能是100秒,所以我不能像杰夫提出的那样使用任何常规舍入方案。不幸的是,我上面的最小例子没有说清楚。

+0

是间隔总是脱节? – unutbu

+0

在这种情况下,是的,间隔总是不相交的。找到重叠间隔的解决方案也很有趣,现在你问了! – pedmiston

回答

6

你可以使用np.searchsorted找到表示其中data['time']每个值将intervals['start_time']之间符合指数。然后,您可以再次拨打np.searchsorted以查找代表data['time']中每个值在intervals['end_time']之间的哪个值适合的索引。请注意,使用np.searchsorted依赖于interval['start_time']interval['end_time']排序。

对于阵列中的每个对应位置,在这两个指数相等的情况下,data['time']适合于interval['start_time']interval['end_time']之间。请注意,这依赖于不相交的间隔。

以这种方式使用searchsorted比使用for-loop快约5倍:

import pandas as pd 
import numpy as np 

np.random.seed(1) 
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)), 
        'value':np.random.uniform(-1,1,size=50)}) 

intervals = pd.DataFrame(
    {'interval_id':np.arange(9), 
    'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),  
    'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)}) 

def using_loop(): 
    data['interval_id'] = np.nan 
    for index, ser in intervals.iterrows(): 
     in_interval = (data['time'] >= ser['start_time']) & \ 
         (data['time'] <= ser['end_time']) 
     data['interval_id'][in_interval] = ser['interval_id'] 

    result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True) 
    return result 

def using_searchsorted(): 
    start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1 
    end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values) 
    mask = (start_idx == end_idx) 
    result = data.copy() 
    result['interval_id'] = result['start_time'] = result['end_time'] = np.nan 
    result['interval_id'][mask] = start_idx 
    result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values 
    result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values 
    return result 

In [254]: %timeit using_loop() 
100 loops, best of 3: 7.74 ms per loop 

In [255]: %timeit using_searchsorted() 
1000 loops, best of 3: 1.56 ms per loop 

In [256]: 7.74/1.56 
Out[256]: 4.961538461538462 
1

您可能希望指定的时间间隔略有不同,但应该给你一个开始。

In [34]: data['on'] = np.round(data['time']/10) 

In [35]: data.merge(intervals,left_on=['on'],right_on=['interval_id'],how='outer') 
Out[35]: 
     time  value on end_time interval_id start_time 
0 1.301658 -0.462594 0 7.630243   0 0.220746 
1 2.202654 0.054903 0 7.630243   0 0.220746 
2 10.253593 0.329947 1 17.715596   1 10.299464 
3 13.803064 -0.601021 1 17.715596   1 10.299464 
4 17.086290 0.484119 2 27.175455   2 24.710704 
5 21.797655 0.988212 2 27.175455   2 24.710704 
6 26.265165 0.491410 3 37.702968   3 30.670753 
7 27.777182 -0.121691 3 37.702968   3 30.670753 
8 34.066473 0.659260 3 37.702968   3 30.670753 
9 34.786337 -0.230026 3 37.702968   3 30.670753 
10 35.343021 0.364505 4 49.489028   4 42.948486 
11 35.506895 0.953562 4 49.489028   4 42.948486 
12 36.129951 -0.703457 4 49.489028   4 42.948486 
13 38.794690 -0.510535 4 49.489028   4 42.948486 
14 40.508702 -0.763417 4 49.489028   4 42.948486 
15 43.974516 -0.149487 4 49.489028   4 42.948486 
16 46.219554 0.893025 5 57.086065   5 53.124795 
17 50.206860 0.729106 5 57.086065   5 53.124795 
18 50.395082 -0.807557 5 57.086065   5 53.124795 
19 50.410783 0.996247 5 57.086065   5 53.124795 
20 51.602892 0.144483 5 57.086065   5 53.124795 
21 52.006921 -0.979778 5 57.086065   5 53.124795 
22 52.682896 -0.593500 5 57.086065   5 53.124795 
23 52.836037 0.448370 5 57.086065   5 53.124795 
24 53.052130 -0.227245 5 57.086065   5 53.124795 
25 57.169775 0.659673 6 65.927106   6 61.590948 
26 59.336176 -0.893004 6 65.927106   6 61.590948 
27 60.297771 0.897418 6 65.927106   6 61.590948 
28 61.151664 0.176229 6 65.927106   6 61.590948 
29 61.769023 0.894644 6 65.927106   6 61.590948 
30 64.221220 0.893012 6 65.927106   6 61.590948 
31 67.907417 -0.859734 7 78.192671   7 72.463468 
32 71.460483 -0.271364 7 78.192671   7 72.463468 
33 74.514028 0.621174 7 78.192671   7 72.463468 
34 75.822643 -0.351684 8 88.820139   8 83.183825 
35 84.252778 -0.685043 8 88.820139   8 83.183825 
36 84.838361 0.354365 8 88.820139   8 83.183825 
37 85.770611 -0.089678 9  NaN   NaN   NaN 
38 85.957559 0.649995 9  NaN   NaN   NaN 
39 86.498339 0.569793 9  NaN   NaN   NaN 
40 91.006735 0.731006 9  NaN   NaN   NaN 
41 91.941862 0.964376 9  NaN   NaN   NaN 
42 94.617522 0.626889 9  NaN   NaN   NaN 
43 95.318288 -0.088918 10  NaN   NaN   NaN 
44 95.595243 0.539685 10  NaN   NaN   NaN 
45 95.818267 -0.989647 10  NaN   NaN   NaN 
46 98.240444 0.931445 10  NaN   NaN   NaN 
47 98.722869 0.442502 10  NaN   NaN   NaN 
48 99.349198 0.585264 10  NaN   NaN   NaN 
49 99.829372 -0.743697 10  NaN   NaN   NaN 

[50 rows x 6 columns] 
+0

谢谢杰夫!不幸的是,我认为这并不适用于所有情况。例如,如果我有一个时间= 0.19的事件,但第一个时间间隔直到3.21才开始,则向下舍入将该事件放入第一个时间间隔。有一些解决方法 - 我猜你的意思是指定时间略有不同 - 但我认为有更深的问题,间隔时间不规则。在解决方案后,我改进了我的问题,所以谢谢(对不起更清楚的抱歉)。 – pedmiston