查找多列中的重复项并删除行 - 熊猫

如果名称出现在任何后续行中，我想删除该行。主要是我不知道如何获得找到重复的索引，然后使用该索引号从df中删除它。查找多列中的重复项并删除行 - 熊猫

import pandas as pd 
data = {'interviewer': ['Jason', 'Molly', 'Jermaine', 'Jake', 'Amy'], 
     'candidate': ['Bob', 'Jermaine', 'Ahmed', 'Karl', 'Molly'], 
     'year': [2012, 2012, 2013, 2014, 2014], 
     'reports': [4, 24, 31, 2, 3]} 

df = pd.DataFrame(data) 
#names = pd.unique(df[['interviewer', 'candidate']].values.ravel()).tolist() 

mt = [] 

for i, c in zip(df.interviewer, df.candidate): 
    print i, c 
    if i not in mt: 
     if c not in mt: 
      mt.append(df.loc[(df.interviewer == i) & (df.candidate == c)]) 
    else: 
     continue

我的想法是使用mt作为一个列表传递给df.drop，并与指数下降的行。我想要的结果是没有看到Molly或Jermaine再次出现在指数2或4中 - df.drop([2,4], inplace=True)。

EDITED

，我想出了一个方法来创建索引我想的列表传递给降：

import pandas as pd 
data = {'interviewer': ['Jason', 'Molly', 'Jermaine', 'Jake', 'Amy'], 
     'candidate': ['Bob', 'Jermaine', 'Ahmed', 'Karl', 'Molly'], 
     'year': [2012, 2012, 2013, 2014, 2014], 
     'reports': [4, 24, 31, 2, 3]} 

df = pd.DataFrame(data) 
#print df 
counter = -1 
bad_rows = [] 
names = [] 
for i, c in zip(df.interviewer, df.candidate): 
    print i, c 

    counter += 1 
    print counter 
    if i not in names: 
     names.append(i) 
    else: 
     bad_rows.append(counter) 
    if c not in names: 
     names.append(c) 
    else: 
     bad_rows.append(counter) 

#print df.drop(bad_rows)

但是必须有这样做更聪明的方式，也许沿着@Ami_Tavory答案itertools的东西？

来源

2016-07-24 noblerthanoedipus

你可能想看看这个：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html – albert

我想'df.drop_duplicates（[ '候选人'，'面试官']）'但这只在两者匹配时才起作用。我正在寻找'当发现名称 - 删除行功能 – noblerthanoedipus

（在当这个答案写的时候，又出现了口头说明和代码示例之间的一些差异。）

您可以使用isin检查的项目出现在不同的列，如下所示：

In [5]: df.candidate.isin(df.interviewer) 
Out[5]: 
0 False 
1  True 
2 False 
3 False 
4  True 
Name: candidate, dtype: bool

因此，你可以这样做

df[~df.candidate.isin(df.interviewer)]

注意，这符合您的原代码，而不是你specificati 后续行。如果你只想要如果行是随后的回落，我会用itertools去，是这样的：

In [18]: bads = [i for ((i, cn), (j, iv)) in itertools.product(enumerate(df.candidate), enumerate(df.interviewer)) if j >=i and cn == iv] 

In [19]: df[~df.index.isin(bads)] 
Out[19]: 
    candidate interviewer reports year 
0  Bob  Jason  4 2012 
2  Ahmed Jermaine  31 2013 
3  Karl  Jake  2 2014 
4  Molly   Amy  3 2014

另外，如果你要删除的后续行，只需更改的事情

In [18]: bads = [j for ((i, cn), (j, iv)) in itertools.product(enumerate(df.candidate), enumerate(df.interviewer)) if j >=i and cn == iv]

来源

2016-07-24 19:36:27

谢谢，但不是我想要的 - 原代码中糟糕的代码。我现在提出了答案，所以请告知它有一个通过itertools – noblerthanoedipus

@noblerthanoedipus的快捷方式请参阅更新。 –

，在索引= 1时丢弃'Molly'的第一个实例，而应该丢弃索引= 4 - 'Molly'的第二个实例;随后发生。我使用与'pd.drop_duplicates（[subset]，keep ='first'）'相同的想法。 – noblerthanoedipus

我为我想做的事情做了一个功能。使用df.index可以安全地用于任何数字索引。

def drop_dup_rows(df): 
    names = [] 
    for i, c, ind in zip(df.interviewer, df.candidate, df.index.tolist()): 
     if any(x in names for x in [i, c]): 
      df.drop(ind, inplace=True) 
     else: 
      names.extend([i,c]) 
    return df

来源

2016-07-25 03:39:28 noblerthanoedipus

查找多列中的重复项并删除行 - 熊猫

回答

相关问题