2016-02-13 29 views
0

我有一个csv文件。它看起来像这样;找出哪个是python熊猫数据结构中的副本

name,id, 
AAA,1111, 
BBB,2222, 
CCC,3333, 
DDD,2222, 

我想知道列id中是否有重复。如果是,找出重复。在这种情况下,答案是2222

我有代码找出是否存在重复。这里是;

import pandas as pd 
csv_file = 'C:/test.csv' 
df = pd.read_csv(csv_file) 
df['id'].duplicated().any() 

问题是如何找出重复?

我正在使用python 2.7和熊猫。

+0

[检查python熊猫数据结构中的重复项]可能的重复项(http://stackoverflow.com/questions/35376308/check-for-duplicates-in-a-python-panda-data-structure) –

回答

0

我认为你可以使用duplicatedkeep是省略,因为keep='first'是默认值)。或者,如果你需要的值tolist

print df['id'][df.duplicated(subset=['id'])] 
3 2222 
Name: id, dtype: int64 

print df['id'][df.duplicated(subset=['id'])].tolist() 
[2222] 

您可以检查duplicated

print df.duplicated(subset=['id'], keep='first') 
0 False 
1 False 
2 False 
3  True 
dtype: bool 

print df.duplicated(subset=['id'], keep='last') 
0 False 
1  True 
2 False 
3 False 
dtype: bool 

print df.duplicated(subset=['id'], keep=False) 
0 False 
1  True 
2 False 
3  True 
dtype: bool 

如果你需要重复的行使用子集:

print df[df.duplicated(subset=['id'], keep='first')] 
    name id 
3 DDD 2222 

print df[df.duplicated(subset=['id'], keep='last')] 
    name id 
1 BBB 2222 

print df[df.duplicated(subset=['id'], keep=False)] 
    name id 
1 BBB 2222 
3 DDD 2222 

使用drop_duplicates为下降:

print df.drop_duplicates(subset=['id'], keep='first') 
    name id 
0 AAA 1111 
1 BBB 2222 
2 CCC 3333 

print df.drop_duplicates(subset=['id'], keep='last') 
    name id 
0 AAA 1111 
2 CCC 3333 
3 DDD 2222 

print df.drop_duplicates(subset=['id'], keep=False) 
    name id 
0 AAA 1111 
2 CCC 3333 
+0

为什么选择downvote?有什么问题吗? – jezrael