2014-07-10 24 views
2

我有2列类似下面的数据集之间存在......熊猫下降重复,如果是相反的两列

InteractorA InteractorB 
AGAP028204 AGAP005846 
AGAP028204 AGAP003428 
AGAP028200 AGAP011124 
AGAP028200 AGAP004335 
AGAP028200 AGAP011356 
AGAP028194 AGAP008414 

我使用的是熊猫,我想删除这是目前排两次,但只是相反像下面......从这个...

InteractorA InteractorB 
AGAP002741 AGAP008026 
AGAP008026 AGAP002741 

要这个......

InteractorA InteractorB 
AGAP002741 AGAP008026 

,因为它们是所有意图s和目的是一样的。

是否有内置的方法来处理这个问题?

回答

3

我最终制作了一个hacky脚本,它遍历行和必要的数据片段,并检查连接是否出现,或者是否出现反转,并根据需要删除行索引。

import pandas as pd 

checklist = [] 
indexes_to_drop = [] 

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t') 

for index, row in interactions.iterrows(): 
    check_string = row['InteractorA'] + row['InteractorB'] 
    check_string_rev = row['InteractorB'] + row['InteractorA'] 
    if (check_string or check_string_rev) in checklist: 
     indexes_to_drop.append(index) 
    else: 
     pass 
    checklist.append(check_string) 
    checklist.append(check_string_rev) 

no_dups = interactions.drop(interactions.index[indexes_to_drop]) 

print no_dups.shape 

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False) 

2017年编辑:上几年,有位更有经验,这是任何人都在寻找类似的东西一个更优雅的解决方案:

In [8]: df 
Out[8]: 
    InteractorA InteractorB 
0 AGAP028204 AGAP005846 
1 AGAP028204 AGAP003428 
2 AGAP028200 AGAP011124 
3 AGAP028200 AGAP004335 
4 AGAP028200 AGAP011356 
5 AGAP028194 AGAP008414 
6 AGAP002741 AGAP008026 
7 AGAP008026 AGAP002741 

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1) 

In [19]: df 
Out[19]: 
    InteractorA InteractorB   check_string 
0 AGAP028204 AGAP005846 AGAP005846AGAP028204 
1 AGAP028204 AGAP003428 AGAP003428AGAP028204 
2 AGAP028200 AGAP011124 AGAP011124AGAP028200 
3 AGAP028200 AGAP004335 AGAP004335AGAP028200 
4 AGAP028200 AGAP011356 AGAP011356AGAP028200 
5 AGAP028194 AGAP008414 AGAP008414AGAP028194 
6 AGAP002741 AGAP008026 AGAP002741AGAP008026 
7 AGAP008026 AGAP002741 AGAP002741AGAP008026 

In [20]: df.drop_duplicates('check_string') 
Out[20]: 
    InteractorA InteractorB   check_string 
0 AGAP028204 AGAP005846 AGAP005846AGAP028204 
1 AGAP028204 AGAP003428 AGAP003428AGAP028204 
2 AGAP028200 AGAP011124 AGAP011124AGAP028200 
3 AGAP028200 AGAP004335 AGAP004335AGAP028200 
4 AGAP028200 AGAP011356 AGAP011356AGAP028200 
5 AGAP028194 AGAP008414 AGAP008414AGAP028194 
6 AGAP002741 AGAP008026 AGAP002741AGAP008026 
0

我认为有以下将工作:

In [37]: 
import pandas as pd 
import io 
temp = """InteractorA InteractorB 
AGAP028204 AGAP005846 
AGAP028204 AGAP003428 
AGAP028200 AGAP011124 
AGAP028200 AGAP004335 
AGAP028200 AGAP011356 
AGAP028194 AGAP008414 
AGAP002741 AGAP008026 
AGAP008026 AGAP002741""" 
df = pd.read_csv(io.StringIO(temp), sep='\s+') 
df 
Out[37]: 
    InteractorA InteractorB 
0 AGAP028204 AGAP005846 
1 AGAP028204 AGAP003428 
2 AGAP028200 AGAP011124 
3 AGAP028200 AGAP004335 
4 AGAP028200 AGAP011356 
5 AGAP028194 AGAP008414 
6 AGAP002741 AGAP008026 
7 AGAP008026 AGAP002741 

所以,我下载你的数据和误解你想要什么,所以下面将现在的工作:

# first get the values that are unique 
In [72]: 
df1 = df[~df.InteractorA.isin(df.InteractorB)] 
df1.shape 
Out[72]: 
(2386, 2) 

现在,我们想要得到的重复的行但取第一个值:

In [74]: 

df2 = df[df.InteractorA.isin(df.InteractorB)] 
df2 = df2.groupby('InteractorA').first().reset_index() 
df2.shape 
Out[74]: 
(3074, 2) 

现在连接到2个数据帧:

In [75]: 

merged = pd.concat([df1, df2], ignore_index=True) 
merged.shape 
Out[75]: 
(5460, 2) 

我认为现在是正确的。

+0

这似乎摆脱其中的一些,但不是全部,例如我仍然有'AGAP007031 \t AGAP010 265'和'AGAP010265 \t AGAP007031'出现在我的数据集中。 – BML91

+0

仍然适用于我,您是否可以发布更多数据,以便我可以了解这个失败的位置 – EdChum

+0

确定数据集位于此处 - https://dl.dropboxusercontent.com/u/6037105/interactions_unique。txt – BML91

0

这是最彻底的解决方案我已经成功地为自己的目的而工作。

创建一个具有各行结合在排序列表中

df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)] 

无法在名单上重复的下降,使列应为字符串

df['sorted_row'] = df['sorted_row'].astype(str) 

删除重复

df.drop_duplicates(subset=['sorted_row'], inplace=True)