得到数据框中片与列名的列表，并不是所有列在数据帧

考虑df得到数据框中片与列名的列表，并不是所有列在数据帧

df = pd.DataFrame(np.ones((2, 3)), columns=list('abc')) 
df

col_list = list('bcd') 

df[col_list]

产生一个错误

KeyError: "['d'] not in index"

如何获得尽可能多的色谱柱？

来源

2016-10-25 piRSquared

有关使用Index.intersection()什么？

In [69]: df[df.columns.intersection(col_list)] 
Out[69]: 
    b c 
0 1.0 1.0 
1 1.0 1.0 

In [70]: df.columns 
Out[70]: Index(['a', 'b', 'c'], dtype='object') # <---------- Index

时间：

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True) 

In [22]: df_.shape 
Out[22]: (200000, 3) 

In [23]: df.columns 
Out[23]: Index(['a', 'b', 'c'], dtype='object') 

In [24]: col_list = list('bcd') 

In [28]: %timeit df_[df_.columns.intersection(col_list)] 
100 loops, best of 3: 6.24 ms per loop 

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]] 
100 loops, best of 3: 5.69 ms per loop

让我们来测试它调换DF（3行，200K列）：

In [30]: t = df_.T 

In [31]: t.shape 
Out[31]: (3, 200000) 

In [32]: t 
Out[32]: 
    0  1  2  3  4  ... 199995 199996 199997 199998 199999 
a  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
b  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
c  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 

[3 rows x 200000 columns] 

In [33]: col_list=[-10, -20, 10, 20, 100] 

In [34]: %timeit t[t.columns.intersection(col_list)] 
10 loops, best of 3: 52.8 ms per loop 

In [35]: %timeit t[[col for col in col_list if col in t.columns]] 
10 loops, best of 3: 103 ms per loop

结论：几乎总是列表理解赢得了小名单和熊猫/ NumPy赢得更大的数据集...

来源

2016-10-25 17:30:42 MaxU

我忘了广泛的测试... – piRSquared

如何：

df[[col for col in list('bcd') if col in df.columns]]

这产生了：

 b c 
0 1.0 1.0 
1 1.0 1.0

来源

2016-10-25 17:40:51

Index对象支持isin：

In [4]:  
col_list = list('bcd') 
df.ix[:,df.columns.isin(col_list)] 

Out[4]: 
    b c 
0 1 1 
1 1 1

因此，这将产生现有列的反对传递一个布尔面具列表

计时

In [5]: 
df_ = pd.concat([df] * 10**5, ignore_index=True) 
%timeit df_[df_.columns.intersection(col_list)] 
%timeit df_[[col for col in col_list if col in df_.columns]] 
%timeit df_.ix[:,df_.columns.isin(col_list)] 

100 loops, best of 3: 12.8 ms per loop 
100 loops, best of 3: 18.6 ms per loop 
10 loops, best of 3: 26.6 ms per loop

这是最慢的方法，但其更少的字符，也许更容易理解

来源

2016-10-25 22:48:27 EdChum

我问这个问题，因为它是一个那些烦人的事情，让我当我开始使用熊猫。我认为这个答案是非常有用的，我怀疑很多人会选择它。 – piRSquared

得到数据框中片与列名的列表，并不是所有列在数据帧

回答

相关问题