如何基于顶级的K值在熊猫的数据帧

我创建了这样一个数据帧确定的行和列：如何基于顶级的K值在熊猫的数据帧

import pandas as pd 
d = {'gene' : ['foo', 'qux', 'bar', 'bin'], 
    'one' : [1., 2., 3., 1.], 
    'two' : [4., 3., 2., 1.], 
    'three' : [1., 2., 20., 1.], 
    } 

df = pd.DataFrame(d) 

# # List top 5 values 
# ndf = df[['one','two','three']] 
# top = ndf.values.flatten().tolist() 
# top.sort(reverse=True) 
# top[0:5] 
# [20.0, 4.0, 3.0, 3.0, 2.0]

它看起来像这样：

In [58]: df 
Out[58]: 
    gene one three two 
0 foo 1  1 4 
1 qux 2  2 3 
2 bar 3  20 2 
3 bin 1  1 1

我想要什么要做的就是折叠第二列以后的所有值。获得前5名成绩，并确定选择的行对应的行/列：

然后将所需词典将是这样的：

{'foo':['two'], 
'qux':['one','two','three'], 
'bar':['one','two','three']}

我怎样才能做到这一点？

来源

2016-03-01 neversaint

可以叠加的数据帧，然后得到最大的5个值（我用的排名，因为它似乎是要包括所有的关系），然后按基因得到字典。

In [2]: df_stack = df.set_index('gene').stack() 

In [3]: df_top = df_stack.loc[df_stack.rank('min', ascending=False) <= 5] 

In [4]: print df_top.reset_index(0).groupby('gene').groups 
{'qux': ['one', 'three', 'two'], 'foo': ['two'], 'bar': ['one', 'three', 'two']}

来源

2016-03-01 12:05:35 Colin

这里是工作但不干净的熊猫解决方案。

top5=top[0:5] 
dt=df.set_index('gene').T 
d={} 
for col in dt.columns: 
    idx_list=dt[col][dt[col].isin(top5)].index.tolist() 
    if idx_list: 
     d[col]=idx_list 
d

将返回

{'bar': ['one', 'three', 'two'], 
'foo': ['two'], 
'qux': ['one', 'three', 'two']}

来源

2016-03-01 09:27:35 tworec

# Get n'th largest unique value from dataframe. 
n = 5 
threshold = pd.Series([col for row in df.iloc[:, 1:].values 
         for col in row]).nlargest(n).iat[-1] 

d = {} 
for g, row in df.iloc[:, 1:].iterrows(): 
    vals = row[row.ge(threshold)].index.tolist() 
    if vals: 
     d[df.gene.iat[g]] = vals 

>>> d 
{'bar': ['one', 'three', 'two'], 
'foo': ['two'], 
'qux': ['one', 'three', 'two']}

来源

2016-03-01 09:27:46 Alexander

在开始之前，我设置了gene列索引。这使得更容易隔离数字列（像你这样做ndf），也更容易返回一个字典以后：

df.set_index('gene', inplace=True)

我再分两步进行。

首先，通过numpy获得第五大值，在此answer的精神：

import numpy as np 
a = df.as_matrix().flatten() 
n_max = -np.partition(-a, 5)[4]

使用partition避免排序整个数组（像你top所做的那样），这可能是昂贵的当该阵列很大。

二，apply一个lambda函数来检索列名：

df.apply(lambda row: row.index[row >= n_max].tolist(), axis=1).to_dict()

注一排的指标，因为每一行是一个系列，怎么样，是数据帧的列。结果：

{'bar': ['one', 'three', 'two'], 
'bin': [], 
'foo': ['two'], 
'qux': ['one', 'three', 'two']}

来源

2016-03-01 09:38:57 IanS

如何基于顶级的K值在熊猫的数据帧

回答

相关问题