2014-01-13 108 views
0

我想找到每秒最大的买卖差价。假设我有这个报价文件:熊猫中的时间戳聚合

In [1]: !head quotes.txt 
exchtime|bid|ask 
1389178814.587758|520.0000|541.0000 
1389178830.462050|540.4300|540.8700 
1389178830.462050|540.4300|540.8700 
1389178830.468602|540.4300|540.8600 
1389178830.468602|540.4300|540.8600 
1389178847.67500|540.4300|540.8500 
1389178847.67500|540.4300|540.8500 
1389178847.73541|540.4300|540.8400 
1389178847.73541|540.4300|540.8400 

时间戳只是自UTC时代以来的秒数。随着第一列一些技巧,我可以读这样的文件:

import pandas as pd 
import numpy as np 
from datetime import datetime 

def convert(x): return np.datetime64(datetime.fromtimestamp(float(x)).isoformat()) 

df = pd.read_csv('quotes.txt', sep='|', parse_dates=True, converters={0:convert}) 

并且这产生我想要的东西:

In [10]: df.head() 
Out[10]: 
        exchtime  bid  ask 
0 2014-01-08 11:00:14.587758 520.00 541.00 
1 2014-01-08 11:00:30.462050 540.43 540.87 
2 2014-01-08 11:00:30.462050 540.43 540.87 
3 2014-01-08 11:00:30.468602 540.43 540.86 
4 2014-01-08 11:00:30.468602 540.43 540.86 

我难倒上聚集。在Q/KDB +,我只想做:

select spread:max ask-bid by exchtime.second from df 

我什么来-了在熊猫是

df['spread'] = df.ask - df.bid 
df['exchtime_sec'] = [e.replace(microsecond=0) for e in df.exchtime] 
df.groupby('exchtime_sec')['spread'].agg(np.max) 

这似乎工作,但exchtime_sec线大约需要三个数量级长于预计将运行!是否有更快(更简洁)的方式来表达这种汇总?

回答

4

阅读这样的,W/O与转换器,转换的时间

In [11]: df = read_csv(StringIO(data),sep='|') 

这是更快

In [12]: df['exchtime'] = pd.to_datetime(df['exchtime'],unit='s') 

In [13]: df 
Out[13]: 
        exchtime  bid  ask 
0 2014-01-08 11:00:14.587758 520.00 541.00 
1 2014-01-08 11:00:30.462050 540.43 540.87 
2 2014-01-08 11:00:30.462050 540.43 540.87 
3 2014-01-08 11:00:30.468602 540.43 540.86 
4 2014-01-08 11:00:30.468602 540.43 540.86 
5 2014-01-08 11:00:47.675000 540.43 540.85 
6 2014-01-08 11:00:47.675000 540.43 540.85 
7 2014-01-08 11:00:47.735410 540.43 540.84 
8 2014-01-08 11:00:47.735410 540.43 540.84 

[9 rows x 3 columns] 

创建蔓延列

In [15]: df['spread'] = df.ask-df.bid 

设置索引到exchtime,以1秒的间隔重新取样,并取最大的 作为aggre鳄鱼

In [16]: df.set_index('exchtime').resample('1s',how=np.max) 
Out[16]: 
         bid  ask spread 
exchtime         
2014-01-08 11:00:14 520.00 541.00 21.00 
2014-01-08 11:00:15  NaN  NaN  NaN 
2014-01-08 11:00:16  NaN  NaN  NaN 
2014-01-08 11:00:17  NaN  NaN  NaN 
2014-01-08 11:00:18  NaN  NaN  NaN 
2014-01-08 11:00:19  NaN  NaN  NaN 
2014-01-08 11:00:20  NaN  NaN  NaN 
2014-01-08 11:00:21  NaN  NaN  NaN 
2014-01-08 11:00:22  NaN  NaN  NaN 
2014-01-08 11:00:23  NaN  NaN  NaN 
2014-01-08 11:00:24  NaN  NaN  NaN 
2014-01-08 11:00:25  NaN  NaN  NaN 
2014-01-08 11:00:26  NaN  NaN  NaN 
2014-01-08 11:00:27  NaN  NaN  NaN 
2014-01-08 11:00:28  NaN  NaN  NaN 
2014-01-08 11:00:29  NaN  NaN  NaN 
2014-01-08 11:00:30 540.43 540.87 0.44 
2014-01-08 11:00:31  NaN  NaN  NaN 
2014-01-08 11:00:32  NaN  NaN  NaN 
2014-01-08 11:00:33  NaN  NaN  NaN 
2014-01-08 11:00:34  NaN  NaN  NaN 
2014-01-08 11:00:35  NaN  NaN  NaN 
2014-01-08 11:00:36  NaN  NaN  NaN 
2014-01-08 11:00:37  NaN  NaN  NaN 
2014-01-08 11:00:38  NaN  NaN  NaN 
2014-01-08 11:00:39  NaN  NaN  NaN 
2014-01-08 11:00:40  NaN  NaN  NaN 
2014-01-08 11:00:41  NaN  NaN  NaN 
2014-01-08 11:00:42  NaN  NaN  NaN 
2014-01-08 11:00:43  NaN  NaN  NaN 
2014-01-08 11:00:44  NaN  NaN  NaN 
2014-01-08 11:00:45  NaN  NaN  NaN 
2014-01-08 11:00:46  NaN  NaN  NaN 
2014-01-08 11:00:47 540.43 540.85 0.42 

[34 rows x 3 columns] 

性能比较

In [10]: df = DataFrame(np.random.randn(100000,2),index=date_range('20130101',periods=100000,freq='50U')) 

In [7]: def f1(df): 
    ...:  df = df.copy() 
    ...:  df['seconds'] = [ e.replace(microsecond=0) for e in df.index ] 
    ...:  df.groupby('seconds')[0].agg(np.max) 
    ...:  

In [11]: def f2(df): 
    ....:  df = df.copy() 
    ....:  df.resample('1s',how=np.max) 
    ....:  

In [8]: %timeit f1(df) 
1 loops, best of 3: 692 ms per loop 

In [12]: %timeit f2(df) 
100 loops, best of 3: 2.36 ms per loop 

这里是另一种方式来做到这一点,这对于一个较低的频率更快。 (高/低相当于最大/最小值,其中开放首先,关闭最后)。

In [9]: df = DataFrame(np.random.randn(100000,2),index=date_range('20130101',periods=100000,freq='50L')) 

In [10]: df.groupby(pd.TimeGrouper('1s'))[0].ohlc() 
Out[10]: 
In [11]: %timeit df.groupby(pd.TimeGrouper('1s'))[0].ohlc() 
1000 loops, best of 3: 1.2 ms per loop 
+0

感谢您的支持!尽管如此,我仍然遇到了麻烦。如果我使用50U频率的100000随机采样,我会在3.62毫秒的时间,根据您的具体情况。但是如果我以50L的频率和相同数量的样本走,那么我会在612ms时间!更高的频率更符合我的蜱样本,所以我的真实世界的表现仍然令人生气。任何关于更高频率的想法?我的假设是大熊猫正在逢迎每一个新的指数值。 – chrisaycock

+1

所以这两个分布是完全不同的,U和L之间每1秒有更多的值(当然这就是它的结构)。如果您使用cythonized函数(例如np.mean/np.sum/np.prod/np.var/np.median),那么您将获得类似的性能。最大/最小值不会被cython化,所以它回落到一个较慢的方法(需要一个增强来解决这个问题)...我会打开一个问题。您也可以稍微不同的方式使用first/last。你也可以试试ohlc,这也应该起作用。我将编辑该问题。 – Jeff

+1

有关perf或max/min的问题(但您应该在任何情况下都使用OHLC)https://github.com/pydata/pandas/issues/5927。我之前的评论是关闭的(np.max/min)是cythonized,但由于某种原因,采取较慢的路径。 – Jeff