2017-07-09 107 views
1

请告诉我如何获取其中出现HashCode的ImgFileNames 多于一次在Python中。 注意:仅保留第一次出现并删除剩余部分,即使该值出现在中间或最后或任何地方。删除在Pandas数据框中出现多次重复的值

我有一个数据帧象下面这样:

ImgFileName   HashCodes 
Img_0001 - Copy.tif 162a47470f021a60 
Img_0001.tif  162a47470f021a60 
Img_0002.tif  1b5b5b1aa638dac8 
Img_0003.tif  adadadadadadadad 
Img_0004.tif  adadadadadadadad 
Img_0005 - Copy.tif a5b8648c8c666670 
Img_0005.tif  a5b8648c8c666670 
Img_0006.tif  71b392da6a699392 
Img_0007.tif  71b392da6a699392 
Img_0008.tif  b1b1f2fa6bf97292 
Img_0009.tif  86e82ae4c8b6c9c9 
Img_0010 - Copy.tif 86e8aae4c8b6c9c9 
Img_0010.tif  86e8aae4c8b6c9c9 

而且我想要的输出如下:

ImgFileName   HashCodes 
Img_0001 - Copy.tif 162a47470f021a60 
Img_0003.tif  adadadadadadadad 
Img_0005 - Copy.tif a5b8648c8c666670 
Img_0006.tif  71b392da6a699392 
Img_0009.tif  86e82ae4c8b6c9c9 
+0

看[pandas.DataFrame.drop_duplicates(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) – tarashypka

回答

1

您需要boolean indexingduplicated - 第一过滤所有的受骗者和第二过滤最后的值的首字母大写(keep='last'):

df =df[ df.duplicated('HashCodes', keep=False) & df.duplicated('HashCodes')] 
print (df) 
    ImgFileName   HashCodes 
1 Img_0001.tif 162a47470f021a60 
4 Img_0004.tif adadadadadadadad 
6 Img_0005.tif a5b8648c8c666670 
8 Img_0007.tif 71b392da6a699392 
12 Img_0010.tif 86e8aae4c8b6c9c9 

或者:

df =df[ df.duplicated('HashCodes', keep=False) & df.duplicated('HashCodes', keep='last')] 
print (df) 
      ImgFileName   HashCodes 
0 Img_0001 -Copy.tif 162a47470f021a60 
3   Img_0003.tif adadadadadadadad 
5 Img_0005 -Copy.tif a5b8648c8c666670 
7   Img_0006.tif 71b392da6a699392 
11 Img_0010 -Copy.tif 86e8aae4c8b6c9c9 
+0

谢谢你很多jezrael。 –

+0

很高兴能帮到你!如果我的回答有帮助,请不要忘记[接受](http://meta.stackexchange.com/a/5235/295067) - 点击答案旁边的复选标记('✓')将其从灰色出来填补。谢谢。 – jezrael

相关问题