实施
下面是使用NumPy boolean indexing
一个量化的方法 -
# Extract values into an array
arr = df.values
# Determine the min,max limits along each column
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
# Setup corresponding boolean mask and set those in array to be NaNs
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
# Finally convert to dataframe
df = pd.DataFrame(arr,columns=df.columns)
请注意,或者,可以直接屏蔽到输入数据帧,而不是重新创建它,但这里有趣的发现是布尔索引到NumP中y数组比熊猫数据框更快。由于我们正在过滤整个数据帧,我们可以重新创建数据帧。
仔细看
现在,让我们在掩模的制作部分,这是该解决方案的关键一探究竟。
1)输入数组:
In [148]: arr
Out[148]:
array([[ 6.305, 6.191, 5.918],
[ 6.507, 6.991, 6.203],
[ 6.407, 6.901, 6.908],
[ 6.963, 7.127, 7.116],
[ 7.227, 7.33 , 7.363],
[ 7.445, 7.632, 7.575],
[ 7.71 , 7.837, 7.663],
[ 8.904, 8.971, 8.895],
[ 9.394, 9.194, 8.994],
[ 8.803, 8.113, 9.333],
[ 8.783, 8.783, 8.783]])
2)最小,最大限制:
In [149]: # Determine the min,max limits along each column
...: minl = (arr > 6.5).argmax(0)
...: maxl = (arr>9).argmax(0)
...:
In [150]: minl
Out[150]: array([1, 1, 2])
In [151]: maxl
Out[151]: array([8, 8, 9])
3)使用broadcasting
创建在整个数据帧/阵列跨越并选择元素的掩模设定为NaNs
:
In [152]: R = np.arange(arr.shape[0])[:,None]
In [153]: R
Out[153]:
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]])
In [154]: (R < minl) | (R >= maxl)
Out[154]:
array([[ True, True, True],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, True, True]], dtype=bool)
运行时测试
让我们来看看迄今列出的方法来解决问题,并且因为提到我们会有很多列,所以让我们使用大量的列。列为功能
途径:
def cumsum_app(df): # Listed in other solution by @Merlin
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
def boolean_indexing_app(df): # Approaches listed in this post
arr = df.values
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
df = pd.DataFrame(arr,columns=df.columns)
时序:
In [163]: # Create a random array with floating pt numbers between 6 and 10
...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6)
...:
...: # Create copies for testing approaches
...: df1 = df.copy()
...: df2 = df.copy()
In [164]: %timeit cumsum_app(df1)
100 loops, best of 3: 16.4 ms per loop
In [165]: %timeit boolean_indexing_app(df2)
100 loops, best of 3: 2.09 ms per loop
谢谢,梅林!我仍然试图围绕简单性来包裹我的头,我会。我是所有这一切的新手,特别是〜技巧。仍然试图让我的思想以矢量化的方式思考。如果你能解释这种优雅,我希望我不会是唯一从中受益的人。再次感谢。 – RageQuilt