如何删除熊猫数据框中的特定重复行？

在这个大熊猫据帧：如何删除熊猫数据框中的特定重复行？

df = 

pos index data 
21  36 a,b,c 
21  36 a,b,c 
23  36 c,d,e 
25  36 f,g,h 
27  36 g,h,k 
29  39 a,b,c 
29  39 a,b,c 
31  39 . 
35  39 c,k 
36  41 g,h 
38  41 k,l 
39  41 j,k 
39  41 j,k

我想删除重复的行只在同一指标组中，当他们在子帧的头部区域。

所以，我所做的：

df_grouped = df.groupby(['index'], as_index=True)

现在，

for i, sub_frame in df_grouped: 
    subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)

我想申请这个方法，因为有些pos值将在不应该被删除的尾部区域重复。

有任何建议。

预期输出：

pos index data 
removed 
21  36 a,b,c 
23  36 c,d,e 
25  36 f,g,h 
27  36 g,h,k 
removed 
29  39 a,b,c 
31  39 . 
35  39 c,k 
36  41 g,h 
38  41 k,l 
39  41 j,k 
39  41 j,k

来源

2017-03-20 everestial007

什么'df.drop_duplicates（）'在http://stackoverflow.com/questions/23667369/drop-all-duplicate -row-in-python-pandas？ – Craig

一个简单的'拖放函数可以工作'，但我只想在重复位于'子帧'的头部区域（按索引值分组）时放弃它。这是主要问题。 – everestial007

@克雷格：我只是看了一下这个例子，它不起作用。在做groupby之后，我不得不在每个“subframe”中指定行（但可能有其他方法）。而且，只有一个副本不需要被放置在子帧的头部区域（顶部两行）中。 – everestial007

如果没有在一个单一的应用语句来完成，那么这段代码将只删除重复的头部区域：

data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39], 
     'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41], 
     'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k'] 
} 

df = pd.DataFrame(data) 

accum = [] 
for i, sub_frame in df.groupby('idx'): 
    accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]])) 

df2 = pd.concat(accum) 

print(df2)

EDIT2：我发布的链接命令的第一个版本是错误的，而且仅适用于示例数据。该版本提供了更通用的解决方案，以每OP的要求删除重复行：

df.drop(df.groupby('idx')   # group by the index column 
      .head(2)    # select the first two rows 
      .duplicated()   # create a Series with True for duplicate rows 
      .to_frame(name='duped') # make the Series a dataframe 
      .query('duped')   # select only the duplicate rows 
      .index)     # provide index of duplicated rows to drop

来源

2017-03-20 02:15:30 Craig

如何删除熊猫数据框中的特定重复行？

回答

相关问题