2016-12-21 69 views
2

我有一个每行重复3次的数据帧。在循环的过程中,如何确定是否以前看过一行,然后执行某些操作,即在循环的第二次出现处打印某些内容?在循环数据帧时计算行的发生次数

print df 
     user  date 
0  User001 2014-11-01 
40  User001 2014-11-01 
80  User001 2014-11-01 
120 User001 2014-11-08 
200 User001 2014-11-08 
160 User001 2014-11-08 
280 User001 2014-11-15 
240 User001 2014-11-15 
320 User001 2014-11-15 
400 User001 2014-11-22 
440 User001 2014-11-22 
360 User001 2014-11-22 
... ...... .......... 
... ...... .......... 
1300 User008 2014-11-22 
1341 User008 2014-11-22 
1360 User008 2014-11-22 

for line in df.itertuples(): 
    user = line[1] 
    date = line[2] 

    print user, date 
    #do something after second occurrence of tuple i.e. print "second occurrence" 

('User001', '2014-11-01') 
('User001', '2014-11-01') 
second occurrence 
('User001', '2014-11-01') 
('User001', '2014-11-08') 
('User001', '2014-11-08') 
second occurrence 
('User001', '2014-11-08') 
('User001', '2014-11-15') 
('User001', '2014-11-15') 
second occurrence 
('User001', '2014-11-15') 
('User001', '2014-11-22') 
('User001', '2014-11-22') 
second occurrence 
('User001', '2014-11-22') 
('User008', '2014-11-22') 
('User008', '2014-11-22') 
second occurrence 
('User008', '2014-11-22') 

回答

2

可以使用cumcount为找到第二occurence的所有指标:

mask = df.groupby(['user', 'date']).cumcount() == 1 
idx = mask[mask].index 
print (idx) 
Int64Index([40, 200, 240, 440], dtype='int64') 
for line in df.itertuples(): 
    print (line.user) 
    print (line.date) 
    if line.Index in idx: 
     print ('second occurrence') 

User001 
2014-11-01 
User001 
2014-11-01 
second occurrence 
User001 
2014-11-01 
User001 
2014-11-08 
User001 
2014-11-08 
second occurrence 
User001 
2014-11-08 
User001 
2014-11-15 
User001 
2014-11-15 
second occurrence 
User001 
2014-11-15 
User001 
2014-11-22 
User001 
2014-11-22 
second occurrence 
User001 
2014-11-22 

用于查找索引另一种解决方案是:

idx = df[df.duplicated(['user', 'date']) & 
     df.duplicated(['user', 'date'], keep='last')].index 
print (idx) 
Int64Index([40, 200, 240, 440], dtype='int64') 
1

我会建议使用DataFrame.duplicated() method得到一个布尔指数识别重复的行。

根据您想如何显示重复,你可以以不同的方式使用它,但如果你想遍历行和打印为每一个它是一个重复的通知,这样的事情可能工作:

duplicate_index = df.duplicates() 
for row, dupl in zip(df, duplicate_index): 
    print(row[0], row[1]) 
    if dupl: 
     print('second occurrence') 
1

使用Counter跟踪

from collections import Counter 

seen = Counter() 
for i, row in df.iterrows(): 
    tup = tuple(row.values.tolist()) 
    if seen[tup] == 1: 
     print(tup, ' second occurence') 
    else: 
     print(tup) 
    seen.update([tup]) 

('User001', '2014-11-01') 
('User001', '2014-11-01') second occurence 
('User001', '2014-11-01') 
('User001', '2014-11-08') 
('User001', '2014-11-08') second occurence 
('User001', '2014-11-08') 
('User001', '2014-11-15') 
('User001', '2014-11-15') second occurence 
('User001', '2014-11-15') 
('User001', '2014-11-22') 
('User001', '2014-11-22') second occurence 
('User001', '2014-11-22') 
('User008', '2014-11-22') 
('User008', '2014-11-22') second occurence 
('User008', '2014-11-22')