2016-02-03 73 views
-1

目标:查找GREEN和YELLOW状态之间的平均流逝时间。首先我需要删除所有不必要的行。为了找到经过的时间,我需要第一个GREEN的第一个实例,接着是第一个黄色的实例,一遍又一遍地重复。以下是100,000多行的摘录。使用Python删除CSV文件中的某些行

在下面的例子中,我希望继续行1,2,5,6,9,13,14,15,16,21

Row # Serial Number Time Stamp Status <br> 
1 1400004 3/10/14 11:52 GREEN <br> 
2 1400004 3/15/14 11:45 YELLOW <br> 
3 1400004 3/29/14 7:59 YELLOW <br> 
4 1400004 4/16/14 15:59 YELLOW <br> 
5 1400004 5/10/14 8:18 GREEN <br> 
6 1400004 5/11/14 15:28 YELLOW <br> 
7 1400004 5/23/14 14:10 YELLOW <br> 
8 1400004 5/24/14 7:56 YELLOW <br> 
9 1400004 5/26/14 7:59 GREEN <br> 
10 1400004 5/28/14 8:26 GREEN <br> 
11 1400004 5/30/14 7:28 GREEN <br> 
12 1400004 6/1/14 16:56 GREEN <br> 
13 1400004 6/13/14 17:29 YELLOW <br> 
14 1400004 6/15/14 15:12 GREEN <br> 
15 1400004 6/17/14 8:57 YELLOW <br> 
16 1400007 1/3/14 11:55 GREEN <br> 
17 1400007 1/4/14 15:31 GREEN <br> 
18 1400007 1/15/14 14:44 GREEN <br> 
19 1400007 1/17/14 5:37 GREEN <br> 
20 1400007 1/18/14 5:35 GREEN <br> 
21 1400007 1/18/14 18:32 YELLOW <br> 
22 1400007 1/19/14 21:50 YELLOW <br> 
+1

你有什么问题?显示你的代码和完整的错误信息。 SO不是写你的程序的地方。 – furas

+0

您所提供的数据的预期输出是多少? –

+0

无需时间计算。我可以分开处理。只需要删除不必要的行。 –

回答

2

以下可用于获取只是你要找的线路:

from itertools import groupby 
from datetime import datetime, timedelta 

with open('input.csv', 'rb') as f_input: 
    csv_input = csv.reader(f_input) 
    header = next(csv_input) 

    for k, g in groupby(csv_input, lambda x: x[4]): 
     first_in_group = next(g) 
     print first_in_group[0]  # show first column entry 

这将显示:

1 
2 
5 
6 
9 
13 
14 
15 
16 
21 

为了扩大这一点,我建议采取下列措施:

from itertools import groupby 
from datetime import datetime, timedelta 

with open('input.csv', 'rb') as f_input: 
    csv_input = csv.reader(f_input) 
    header = next(csv_input) 

    for k1, g1 in groupby(csv_input, lambda x: x[1]): # group by serial number 
     last = None 
     entries = [] 
     for k, g in groupby(g1, lambda x: x[4]): # group by status 
      first = next(g) 
      start = datetime.strptime('{} {}'.format(first[2], first[3]), '%m/%d/%y %H:%M') 

      if last: 
       entries.append((first[0], k, start - last)) 
       print '{:4} {:7} {:>20}'.format(first[0], k, start - last) 

      last = start 

     average_seconds = sum((t[2] for t in entries), timedelta()).total_seconds()/float(len(entries)) 
     print "Entries: {} Average mins: {}".format(len(entries), average_seconds/60) 
     print 

这将显示如下输出为您给出的数据:

2 YELLOW  4 days, 23:53:00 
5 GREEN  55 days, 20:33:00 
6 YELLOW  1 day, 7:10:00 
9 GREEN  14 days, 16:31:00 
13 YELLOW  18 days, 9:30:00 
14 GREEN  1 day, 21:43:00 
15 YELLOW  1 day, 17:45:00 
Entries: 7 Average mins: 20340.7142857 

21 YELLOW  15 days, 6:37:00 
Entries: 1 Average mins: 21997.0 

一个问题是,您的时间戳重置为每个新系列数字,所以如果你计算差异,你会得到一个非常消极的时间。另外,还不清楚你的日期和时间是在一列还是两列?该脚本假定两列,例如

Row,#,Serial,Number,Time,Stamp,Status 
1,1400004,3/10/14,11:52,GREEN