选择数据框片

我有加载像这样选择数据框片

 minData = pd.read_csv(
       currentSymbol["fullpath"], 
       header = None, 
       names = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Split Factor', 'Earnings', 'Dividends'], 
       parse_dates = [["Date", "Time"]], 
       date_parser = lambda x : datetime.datetime.strptime(x, '%Y%m%d %H%M'), 
       index_col = "Date_Time", 
       sep=' ')

的数据看起来像一个数据帧这

>>> minData.index 
<class 'pandas.tseries.index.DatetimeIndex'> 
[1998-01-02 09:30:00, ..., 2013-12-09 16:00:00] 
Length: 1373036, Freq: None, Timezone: None 
>>> 

>>> minData.head(5) 
         Open  High  Low Close Volume \ 
Date_Time               
1998-01-02 09:30:00 8.70630 8.70630 8.70630 8.70630 420.73 
1998-01-02 09:35:00 8.82514 8.82514 8.82514 8.82514 420.73 
1998-01-02 09:42:00 8.79424 8.79424 8.79424 8.79424 420.73 
1998-01-02 09:43:00 8.76572 8.76572 8.76572 8.76572 1262.19 
1998-01-02 09:44:00 8.76572 8.76572 8.76572 8.76572 420.73 

        Split Factor Earnings Dividends Active 
Date_Time              
1998-01-02 09:30:00    4   0   0  NaN 
1998-01-02 09:35:00    4   0   0  NaN 
1998-01-02 09:42:00    4   0   0  NaN 
1998-01-02 09:43:00    4   0   0  NaN 
1998-01-02 09:44:00    4   0   0  NaN 

[5 rows x 9 columns]

我可以选择这样的

>>> minData["2004-12-20"] 
         Open  High  Low Close  Volume \ 
Date_Time                
2004-12-20 09:30:00 35.8574 35.9373 35.8025 35.9273 154112.00 
2004-12-20 09:31:00 35.8924 35.9174 35.8824 35.8874 17021.50 
2004-12-20 09:32:00 35.8874 35.8924 35.8824 35.8824 17079.50 
2004-12-20 09:33:00 35.8874 35.9423 35.8724 35.9373 32491.50 
2004-12-20 09:34:00 35.9373 36.0023 35.9174 36.0023 40096.40 
2004-12-20 09:35:00 35.9923 36.2071 35.9923 36.1471 67088.90 
...

从我的数据帧的行

我有看起来像这样的日期（从不同的文件中读取）

>>> ts 
Timestamp('2004-12-20 00:00:00', tz=None) 
>>>

我想在这一天的所有分钟中将“活动”列设置为“真”。

我可以用这个

minData.loc['2004-12-20',"Active"] = True

做到这一点，我可以做我的时间戳日期同样的事情这个疯狂的一段代码

minData.loc[str(ts.year) + "-" + str(ts.month) + "-" + str(ts.day),"Active"] = True

是的，这就是创建一个从时间戳的字符串目的！

我知道必须有一个更好的方式来做到这一点..

来源

2014-03-27 JasonEdinburgh

我会做这其实

In [20]: df = DataFrame(np.random.randn(10,1),index=date_range('20130101 23:55:00',periods=10,freq='T')) 

In [21]: df['Active'] = False 

In [22]: df 
Out[22]: 
          0 Active 
2013-01-01 23:55:00 0.273194 False 
2013-01-01 23:56:00 2.869795 False 
2013-01-01 23:57:00 0.980566 False 
2013-01-01 23:58:00 0.176711 False 
2013-01-01 23:59:00 -0.354976 False 
2013-01-02 00:00:00 0.258194 False 
2013-01-02 00:01:00 -1.765781 False 
2013-01-02 00:02:00 0.106163 False 
2013-01-02 00:03:00 -1.169214 False 
2013-01-02 00:04:00 0.224484 False 

[10 rows x 2 columns] 


In [28]: df['Active'] = False

由于@Andy海登指出，normalize的时间设置为0，这样就可以直接比较时间戳为0的时间戳。

In [34]: df.loc[df.index.normalize() == Timestamp('20130102'),'Active'] = True 

In [35]: df 
Out[35]: 
          0 Active 
2013-01-01 23:55:00 0.273194 False 
2013-01-01 23:56:00 2.869795 False 
2013-01-01 23:57:00 0.980566 False 
2013-01-01 23:58:00 0.176711 False 
2013-01-01 23:59:00 -0.354976 False 
2013-01-02 00:00:00 0.258194 True 
2013-01-02 00:01:00 -1.765781 True 
2013-01-02 00:02:00 0.106163 True 
2013-01-02 00:03:00 -1.169214 True 
2013-01-02 00:04:00 0.224484 True 

[10 rows x 2 columns]

对于真正的精细控制，做到这一点（和你可以使用indexer_at_time如果只想倍作为索引）。并且您始终可以使用和子句来执行更复杂的索引。

In [29]: df.loc[df.index.indexer_between_time('20130101 23:59:00','20130102 00:03:00'),'Active'] = True 

In [30]: df 
Out[30]: 
          0 Active 
2013-01-01 23:55:00 0.273194 False 
2013-01-01 23:56:00 2.869795 False 
2013-01-01 23:57:00 0.980566 False 
2013-01-01 23:58:00 0.176711 False 
2013-01-01 23:59:00 -0.354976 True 
2013-01-02 00:00:00 0.258194 True 
2013-01-02 00:01:00 -1.765781 True 
2013-01-02 00:02:00 0.106163 True 
2013-01-02 00:03:00 -1.169214 True 
2013-01-02 00:04:00 0.224484 False 

[10 rows x 2 columns]

来源

2014-03-27 21:25:41 Jeff

是的，忘了那个！更新 – Jeff

太棒了，谢谢@Jeff！我正在阅读有关正常化的内容，但在这种情况下无法看到如何使用它。我没有读过关于indexer_between_time方法的任何内容。我会做一些研究。再次感谢！ – JasonEdinburgh

选择数据框片

回答

相关问题