2013-03-28 252 views
3

我有一个熊猫数据框,其中一列表示另一列中的位置值是否在其下面的行中发生了变化。作为一个例子,遍历熊猫数据框

2013-02-05 19:45:00 (39.94, -86.159)  True 
2013-02-05 19:50:00 (39.94, -86.159)  True 
2013-02-05 19:55:00 (39.94, -86.159) False 
2013-02-05 20:00:00 (39.777, -85.995) False 
2013-02-05 20:05:00 (39.775, -85.978)  True 
2013-02-05 20:10:00 (39.775, -85.978)  True 
2013-02-05 20:15:00 (39.775, -85.978) False 
2013-02-05 20:20:00 (39.94, -86.159)  True 
2013-02-05 20:30:00 (39.94, -86.159) False 

所以,我想要做的是去逐行通过这个数据帧,并与False检查行。然后(可能会添加另一列),在那个地方总共花费了“连续”时间。像上面的例子一样,可以再次访问同一个地方。在这种情况下,它被认为是一个单独的条件。所以,上面的例子中,是这样的:

2013-02-05 19:45:00 (39.94, -86.159)  True 0 
2013-02-05 19:50:00 (39.94, -86.159)  True 0 
2013-02-05 19:55:00 (39.94, -86.159) False 15 
2013-02-05 20:00:00 (39.777, -85.995) False 5 
2013-02-05 20:05:00 (39.775, -85.978)  True 0 
2013-02-05 20:10:00 (39.775, -85.978)  True 0 
2013-02-05 20:15:00 (39.775, -85.978) False 15 
2013-02-05 20:20:00 (39.94, -86.159)  True 0 
2013-02-05 20:25:00 (39.94, -86.159) False 10 

我会然后绘制的使用每天HIST()函数花这些“持续”时间的直方图。如何通过遍历数据框从第一个数据帧中获取第二个数据帧?我是python和熊猫的新手,真正的数据文件非常庞大,所以我需要一些相当高效的东西。

回答

7

下面是另一个需要

df['group'] = (df.condition == False).astype('int').cumsum().shift(1).fillna(0) 

df 
      date long  lat condition group 
2/5/2013 19:45:00 39.940 -86.159  True  0 
2/5/2013 19:50:00 39.940 -86.159  True  0 
2/5/2013 19:55:00 39.940 -86.159  False  0 
2/5/2013 20:00:00 39.777 -85.995  False  1 
2/5/2013 20:05:00 39.775 -85.978  True  2 
2/5/2013 20:10:00 39.775 -85.978  True  2 
2/5/2013 20:15:00 39.775 -85.978  False  2 
2/5/2013 20:20:00 39.940 -86.159  True  3 
2/5/2013 20:25:00 39.940 -86.159  False  3 

df['result'] = df.groupby(['group']).date.transform(lambda sdf: 5 *len(sdf)) 

df 
      date long  lat condition group result 
2/5/2013 19:45:00 39.940 -86.159  True  0  15 
2/5/2013 19:50:00 39.940 -86.159  True  0  15 
2/5/2013 19:55:00 39.940 -86.159  False  0  15 
2/5/2013 20:00:00 39.777 -85.995  False  1  5 
2/5/2013 20:05:00 39.775 -85.978  True  2  15 
2/5/2013 20:10:00 39.775 -85.978  True  2  15 
2/5/2013 20:15:00 39.775 -85.978  False  2  15 
2/5/2013 20:20:00 39.940 -86.159  True  3  10 
2/5/2013 20:25:00 39.940 -86.159  False  3  10 
+0

非常好! – John

4

您将需要0.11-dev。我认为这会给你你正在寻找的东西。请参阅本节:http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas更多的信息作为timedeltas是,大熊猫是支持

继承人您的数据更新数据(我隔长/ LAT只是为了方便,关键的是, 条件列是一个布尔)

In [137]: df = pd.read_csv(StringIO.StringIO(data),index_col=0,parse_dates=True) 

In [138]: df 
Out[138]: 
       date long  lat condition 
2013-02-05 19:45:00 39.940 -86.159  True 
2013-02-05 19:50:00 39.940 -86.159  True 
2013-02-05 19:55:00 39.940 -86.159  False 
2013-02-05 20:00:00 39.777 -85.995  False 
2013-02-05 20:05:00 39.775 -85.978  True 
2013-02-05 20:10:00 39.775 -85.978  True 
2013-02-05 20:15:00 39.775 -85.978  False 
2013-02-05 20:20:00 39.940 -86.159  True 
2013-02-05 20:25:00 39.940 -86.159  False 

In [139]: df.dtypes 
Out[139]: 
date   float64 
long lat  float64 
condition  bool 
dtype: object 

创建的索引一些日期列(这些都是datetime64 [NS] D型)

In [140]: df['date'] = df.index 
In [141]: df['rdate'] = df.index 

设置RDATE列是假到NAT(np.nan的转化到NAT)

In [142]: df.loc[~df['condition'],'rdate'] = np.nan 

正向从先前值

In [143]: df['rdate'] = df['rdate'].ffill() 

减去从日期的RDATE填充NAT的,这将产生一个timedelta64 [NS]键入的时间差 柱

In [144]: df['diff'] = df['date']-df['rdate'] 

In [151]: df 
Out[151]: 
            date long lat condition    rdate \ 
2013-02-05 19:45:00 2013-02-05 19:45:00 -86.159  True 2013-02-05 19:45:00 
2013-02-05 19:50:00 2013-02-05 19:50:00 -86.159  True 2013-02-05 19:50:00 
2013-02-05 19:55:00 2013-02-05 19:55:00 -86.159  False 2013-02-05 19:50:00 
2013-02-05 20:00:00 2013-02-05 20:00:00 -85.995  False 2013-02-05 19:50:00 
2013-02-05 20:05:00 2013-02-05 20:05:00 -85.978  True 2013-02-05 20:05:00 
2013-02-05 20:10:00 2013-02-05 20:10:00 -85.978  True 2013-02-05 20:10:00 
2013-02-05 20:15:00 2013-02-05 20:15:00 -85.978  False 2013-02-05 20:10:00 
2013-02-05 20:20:00 2013-02-05 20:20:00 -86.159  True 2013-02-05 20:20:00 
2013-02-05 20:25:00 2013-02-05 20:25:00 -86.159  False 2013-02-05 20:20:00 

         diff 
2013-02-05 19:45:00 00:00:00 
2013-02-05 19:50:00 00:00:00 
2013-02-05 19:55:00 00:05:00 
2013-02-05 20:00:00 00:10:00 
2013-02-05 20:05:00 00:00:00 
2013-02-05 20:10:00 00:00:00 
2013-02-05 20:15:00 00:05:00 
2013-02-05 20:20:00 00:00:00 
2013-02-05 20:25:00 00:05:00 

的diff列现在是timedelta64 [ns],所以你想在几分钟内整数 (FYI这是有点笨重,因为大熊猫没有标量类型 Ti medelta类似于日期的时间戳)

(另外,你可能必须在你完成之前对这个rdate系列做一个shift(),我认为我在某个地方被关闭了)......但这是主意

In [175]: df['diff'].map(lambda x: x.item().seconds/60) 
Out[175]: 
2013-02-05 19:45:00  0 
2013-02-05 19:50:00  0 
2013-02-05 19:55:00  5 
2013-02-05 20:00:00 10 
2013-02-05 20:05:00  0 
2013-02-05 20:10:00  0 
2013-02-05 20:15:00  5 
2013-02-05 20:20:00  0 
2013-02-05 20:25:00  5 
+0

你也可以做'ffill(就地= TRUE)',以免造成临时数组复制。 –