2016-10-25 11 views

回答

5

有关使用Index.intersection()什么?

In [69]: df[df.columns.intersection(col_list)] 
Out[69]: 
    b c 
0 1.0 1.0 
1 1.0 1.0 

In [70]: df.columns 
Out[70]: Index(['a', 'b', 'c'], dtype='object') # <---------- Index 

时间:

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True) 

In [22]: df_.shape 
Out[22]: (200000, 3) 

In [23]: df.columns 
Out[23]: Index(['a', 'b', 'c'], dtype='object') 

In [24]: col_list = list('bcd') 

In [28]: %timeit df_[df_.columns.intersection(col_list)] 
100 loops, best of 3: 6.24 ms per loop 

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]] 
100 loops, best of 3: 5.69 ms per loop 

让我们来测试它调换DF(3行,200K列):

In [30]: t = df_.T 

In [31]: t.shape 
Out[31]: (3, 200000) 

In [32]: t 
Out[32]: 
    0  1  2  3  4  ... 199995 199996 199997 199998 199999 
a  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
b  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
c  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 

[3 rows x 200000 columns] 

In [33]: col_list=[-10, -20, 10, 20, 100] 

In [34]: %timeit t[t.columns.intersection(col_list)] 
10 loops, best of 3: 52.8 ms per loop 

In [35]: %timeit t[[col for col in col_list if col in t.columns]] 
10 loops, best of 3: 103 ms per loop 

结论:几乎总是列表理解赢得了小名单和熊猫/ NumPy赢得更大的数据集...

+1

我忘了广泛的测试... – piRSquared

5

如何:

df[[col for col in list('bcd') if col in df.columns]] 

这产生了:

 b c 
0 1.0 1.0 
1 1.0 1.0 
1

Index对象支持isin

In [4]:  
col_list = list('bcd') 
df.ix[:,df.columns.isin(col_list)] 

Out[4]: 
    b c 
0 1 1 
1 1 1 

因此,这将产生现有列的反对传递一个布尔面具列表

计时

In [5]: 
df_ = pd.concat([df] * 10**5, ignore_index=True) 
%timeit df_[df_.columns.intersection(col_list)] 
%timeit df_[[col for col in col_list if col in df_.columns]] 
%timeit df_.ix[:,df_.columns.isin(col_list)] 

100 loops, best of 3: 12.8 ms per loop 
100 loops, best of 3: 18.6 ms per loop 
10 loops, best of 3: 26.6 ms per loop 

这是最慢的方法,但其更少的字符,也许更容易理解

+0

我问这个问题,因为它是一个那些烦人的事情,让我当我开始使用熊猫。我认为这个答案是非常有用的,我怀疑很多人会选择它。 – piRSquared

相关问题