Python熊猫时间序列重新取样给出意想不到的结果

这里的数据是针对具有运行余额的银行账户。我想重新采样数据以仅使用天平衡的结束日期，因此为每天给出的最后一个值。一天中可以有多个数据点，代表多个交易。Python熊猫时间序列重新取样给出意想不到的结果

In [1]: from StringIO import StringIO 

In [2]: import pandas as pd 

In [3]: import numpy as np 

In [4]: print "Pandas version", pd.__version__ 
Pandas version 0.12.0 

In [5]: print "Numpy version", np.__version__ 
Numpy version 1.7.1 

In [6]: data_string = StringIO(""""Date","Balance" 
    ...: "08/09/2013","1000" 
    ...: "08/09/2013","950" 
    ...: "08/09/2013","930" 
    ...: "08/06/2013","910" 
    ...: "08/02/2013","900" 
    ...: "08/01/2013","88" 
    ...: "08/01/2013","87" 
    ...: """) 

In [7]: ts = pd.read_csv(data_string, parse_dates=[0], index_col=0) 

In [8]: print ts 
      Balance 
Date    
2013-08-09  1000 
2013-08-09  950 
2013-08-09  930 
2013-08-06  910 
2013-08-02  900 
2013-08-01  88 
2013-08-01  87

我预计 “2013年8月9日” 为1000，但绝对不是 '中间' 号950

In [10]: ts.Balance.resample('D', how='last') 
Out[10]: 
Date 
2013-08-01  88 
2013-08-02 900 
2013-08-03 NaN 
2013-08-04 NaN 
2013-08-05 NaN 
2013-08-06 910 
2013-08-07 NaN 
2013-08-08 NaN 
2013-08-09 950 
Freq: D, dtype: float64

我预计 “2013年8月9日” 为930，或“2013-08-01”为88.

In [12]: ts.Balance.resample('D', how='first') 
Out[12]: 
Date 
2013-08-01  87 
2013-08-02  900 
2013-08-03  NaN 
2013-08-04  NaN 
2013-08-05  NaN 
2013-08-06  910 
2013-08-07  NaN 
2013-08-08  NaN 
2013-08-09 1000 
Freq: D, dtype: float64

我在这里错过了什么吗？用“第一”和“最后”重新采样不按照我预期的方式工作？

来源

2013-08-23 Grant

为了能够后重新采样您的数据熊猫首先必须对其进行分类。所以，如果你加载数据和索引排序它，你得到如下的事情：

>>> pd.read_csv(data_string, parse_dates=[0], index_col=0).sort_index() 
      Balance 
Date    
2013-08-01  87 
2013-08-01  88 
2013-08-02  900 
2013-08-06  910 
2013-08-09  1000 
2013-08-09  930 
2013-08-09  950

这就解释了为什么你有你得到的结果。 @Jeff解释了为什么顺序是“乱”，并根据您的意见，解决办法是在操作之前的数据使用mergesort算法...

>>> df = pd.read_csv(data_string, parse_dates=[0], 
        index_col=0).sort_index(kind='mergesort') 
>>> df.Balance.resample('D',how='last') 
2013-08-01  88 
2013-08-02  900 
2013-08-03  NaN 
2013-08-04  NaN 
2013-08-05  NaN 
2013-08-06  910 
2013-08-07  NaN 
2013-08-08  NaN 
2013-08-09 1000 
>>> df.Balance.resample('D', how='first') 
2013-08-01  87 
2013-08-02 900 
2013-08-03 NaN 
2013-08-04 NaN 
2013-08-05 NaN 
2013-08-06 910 
2013-08-07 NaN 
2013-08-08 NaN 
2013-08-09 930

来源

2013-08-23 20:42:53

在重复一种是任意的（例如，没有从合并担保或快速排序），IIRC – Jeff

@Jeff我认为是这样。但是，如果Pandas能够识别（读取）数据已经被排序（就像本例中那样）并且使用该排序顺序，那么这将是一个非常好的功能。 :)是的，我知道...这是一个“我不想要小马”的请求:) –

请在github上提出请求;我不知道这是多么棘手（它在组索引中计算） – Jeff

问题是因为你的日期是dups可以有效地是一个任意的顺序;与dups订购不保证。

In [24]: ts.Balance.resample('D',how='last') 
Out[24]: 
Date 
2013-08-01  87 
2013-08-02 900 
2013-08-03 NaN 
2013-08-04 NaN 
2013-08-05 NaN 
2013-08-06 910 
2013-08-07 NaN 
2013-08-08 NaN 
2013-08-09 930 
Freq: D, dtype: float64 

In [25]: ts.Balance.order().resample('D',how='last') 
Out[25]: 
Date 
2013-08-01  88 
2013-08-02  900 
2013-08-03  NaN 
2013-08-04  NaN 
2013-08-05  NaN 
2013-08-06  910 
2013-08-07  NaN 
2013-08-08  NaN 
2013-08-09 1000 
Freq: D, dtype: float64

最简单方法是sort数据，但目前尚不清楚是什么顺序实际上是（例如你需要一个外生参数在这里决定的话）。

通sort=False到GROUPBY（你不能重采样做到这一点虽然）

In [29]: ts.groupby(ts.index,sort=False).last().reindex(date_range(ts.index.min(),ts.index.max())) 
Out[29]: 
      Balance 
2013-08-01  87 
2013-08-02  900 
2013-08-03  NaN 
2013-08-04  NaN 
2013-08-05  NaN 
2013-08-06  910 
2013-08-07  NaN 
2013-08-08  NaN 
2013-08-09  930

你能做到这样，达到您是什么

In [52]: df = DataFrame(ts.values,index=ts.index,columns=['values']).reset_index() 

In [53]: df 
Out[53]: 
       Date values 
0 2013-08-09 00:00:00 1000 
1 2013-08-09 00:00:00  950 
2 2013-08-09 00:00:00  930 
3 2013-08-06 00:00:00  910 
4 2013-08-02 00:00:00  900 
5 2013-08-01 00:00:00  88 
6 2013-08-01 00:00:00  87 

In [54]: df.groupby('Date').apply(lambda x: x.iloc[-1]['values']).reindex(date_range(ts.index.min(),ts.index.max())) 

Out[54]: 
2013-08-01  87 
2013-08-02 900 
2013-08-03 NaN 
2013-08-04 NaN 
2013-08-05 NaN 
2013-08-06 910 
2013-08-07 NaN 
2013-08-08 NaN 
2013-08-09 930 
Freq: D, dtype: float64

来源

2013-08-23 20:39:14 Jeff

Python熊猫时间序列重新取样给出意想不到的结果

回答

相关问题