2016-12-16 48 views
3

我需要在熊猫数据框中查找重复行,然后添加一个带有计数的额外列。比方说,我们有一个数据帧:获取带有原始索引的熊猫重复行数

>>print(df) 

+----+-----+-----+-----+-----+-----+-----+-----+-----+ 
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 
|----+-----+-----+-----+-----+-----+-----+-----+-----| 
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
+----+-----+-----+-----+-----+-----+-----+-----+-----+ 

上述帧随后将与计数的附加列成为下一个。您可以看到我们仍然保留索引列。

+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----| 
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 | 
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 | 
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 | 
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 | 
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 | 
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 

我见过其他的解决方案,这如:

df.groupby(list(df.columns.values)).size() 

但是,返回与差距,并没有初始指数的矩阵。

回答

4

可以使用reset_index先为转换index到列,然后通过firstlenaggregate

此外,如果被全部列需要GROUPBY需要删除index列由difference

print (df.columns.difference(['index'])) 
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object') 

print (df.reset_index() 
     .groupby(df.columns.difference(['index']).tolist())['index'] 
     .agg(['first', 'size']) 
     .reset_index() 
     .set_index(['first']) 
     .sort_index() 
     .rename_axis(None)) 

    2 3 4 5 6 7 8 9 size 
0 0 0 0 0 0 0 0 0  2 
1 2 0 0 0 0 0 0 0  2 
2 2 4 3 4 1 1 4 4  1 
3 4 3 4 0 0 0 0 0  2 
4 2 3 4 3 4 0 0 0  1 
5 5 0 0 0 0 0 0 0  3 
6 4 5 0 0 0 0 0 0  1 
7 1 1 4 0 0 0 0 0  1 
10 3 3 4 3 5 5 5 0  1 
11 5 4 0 0 0 0 0 0  1 
13 0 4 0 0 0 0 0 0  1 
15 1 3 5 0 0 0 0 0  1 
16 4 0 0 0 0 0 0 0  1 
17 3 3 4 4 0 0 0 0  1 

如果有必要添加下一列10需要rename

#if necessary convert to str 
last_col = str(df.columns.astype(int).max() + 1) 
print (last_col) 
10 

print (df.reset_index() 
     .groupby(df.columns.difference(['index']).tolist())['index'] 
     .agg(['first', 'size']) 
     .reset_index() 
     .set_index(['first']) 
     .sort_index() 
     .rename_axis(None) 
     .rename(columns={'size':last_col})) 

    2 3 4 5 6 7 8 9 10 
0 0 0 0 0 0 0 0 0 2 
1 2 0 0 0 0 0 0 0 2 
2 2 4 3 4 1 1 4 4 1 
3 4 3 4 0 0 0 0 0 2 
4 2 3 4 3 4 0 0 0 1 
5 5 0 0 0 0 0 0 0 3 
6 4 5 0 0 0 0 0 0 1 
7 1 1 4 0 0 0 0 0 1 
10 3 3 4 3 5 5 5 0 1 
11 5 4 0 0 0 0 0 0 1 
13 0 4 0 0 0 0 0 0 1 
15 1 3 5 0 0 0 0 0 1 
16 4 0 0 0 0 0 0 0 1 
17 3 3 4 4 0 0 0 0 1 
+0

Thankyou..that工作得很好。 – kPow989

+0

很高兴能帮到你! – jezrael