2016-07-10 148 views
3

我想为整个数据帧制作一个过滤器,其中包括C列以外的许多列。我希望此过滤器返回值每个列一旦达到最小阈值,并在达到最大阈值时停止。我希望最低门槛为6.5,最高为9.0。这不是因为它的声音在我这里挂这么简单......一旦阈值(最小值/最大值)达到阈值(最小值/最大值),用Pandas删除值

数据框:

Time A1 A2 A3 
1 6.305 6.191 5.918 
2 6.507 6.991 6.203 
3 6.407 6.901 6.908 
4 6.963 7.127 7.116 
5 7.227 7.330 7.363 
6 7.445 7.632 7.575 
7 7.710 7.837 7.663 
8 8.904 8.971 8.895 
9 9.394 9.194 8.994 
10 8.803 8.113 9.333 
11 8.783 8.783 8.783 

期望的结果:

Time A1 A2 A3 
1 NaN  NaN  NaN 
2 6.507 6.991 NaN 
3 6.407 6.901 6.908 
4 6.963 7.127 7.116 
5 7.227 7.330 7.363 
6 7.445 7.632 7.575 
7 7.710 7.837 7.663 
8 8.904 8.971 8.895 
9 NaN  NaN  8.994 
10 NaN  NaN  NaN 
11 NaN  NaN  NaN 

开车回家的地步,在列A,例如,在时间3有一个值6.407,它低于6.5阈值,但由于在时间2满足阈值,我希望保持数据,一旦达到最低阈值。至于上限,在时间9的列A中,该值高于9.0阈值,所以我希望它忽略该值和超出该值的值,即使其余值小于9.0。我希望能够遍历许多更多的列。

谢谢!!!

回答

2

实施

下面是使用NumPy boolean indexing一个量化的方法 -

# Extract values into an array 
arr = df.values 

# Determine the min,max limits along each column 
minl = (arr > 6.5).argmax(0) 
maxl = (arr>9).argmax(0) 

# Setup corresponding boolean mask and set those in array to be NaNs 
R = np.arange(arr.shape[0])[:,None] 
mask = (R < minl) | (R >= maxl) 
arr[mask] = np.nan 

# Finally convert to dataframe 
df = pd.DataFrame(arr,columns=df.columns) 

请注意,或者,可以直接屏蔽到输入数据帧,而不是重新创建它,但这里有趣的发现是布尔索引到NumP中y数组比熊猫数据框更快。由于我们正在过滤整个数据帧,我们可以重新创建数据帧。

仔细看

现在,让我们在掩模的制作部分,这是该解决方案的关键一探究竟。

1)输入数组:

In [148]: arr 
Out[148]: 
array([[ 6.305, 6.191, 5.918], 
     [ 6.507, 6.991, 6.203], 
     [ 6.407, 6.901, 6.908], 
     [ 6.963, 7.127, 7.116], 
     [ 7.227, 7.33 , 7.363], 
     [ 7.445, 7.632, 7.575], 
     [ 7.71 , 7.837, 7.663], 
     [ 8.904, 8.971, 8.895], 
     [ 9.394, 9.194, 8.994], 
     [ 8.803, 8.113, 9.333], 
     [ 8.783, 8.783, 8.783]]) 

2)最小,最大限制:

In [149]: # Determine the min,max limits along each column 
    ...: minl = (arr > 6.5).argmax(0) 
    ...: maxl = (arr>9).argmax(0) 
    ...: 

In [150]: minl 
Out[150]: array([1, 1, 2]) 

In [151]: maxl 
Out[151]: array([8, 8, 9]) 

3)使用broadcasting创建在整个数据帧/阵列跨越并选择元素的掩模设定为NaNs

In [152]: R = np.arange(arr.shape[0])[:,None] 

In [153]: R 
Out[153]: 
array([[ 0], 
     [ 1], 
     [ 2], 
     [ 3], 
     [ 4], 
     [ 5], 
     [ 6], 
     [ 7], 
     [ 8], 
     [ 9], 
     [10]]) 

In [154]: (R < minl) | (R >= maxl) 
Out[154]: 
array([[ True, True, True], 
     [False, False, True], 
     [False, False, False], 
     [False, False, False], 
     [False, False, False], 
     [False, False, False], 
     [False, False, False], 
     [False, False, False], 
     [ True, True, False], 
     [ True, True, True], 
     [ True, True, True]], dtype=bool) 

运行时测试

让我们来看看迄今列出的方法来解决问题,并且因为提到我们会有很多列,所以让我们使用大量的列。列为功能

途径:

def cumsum_app(df): # Listed in other solution by @Merlin 
    df2 = df > 6.5 
    df = df[df2.cumsum()>0] 
    df2 = df > 9 
    df = df[~(df2.cumsum()>0)] 

def boolean_indexing_app(df): # Approaches listed in this post 
    arr = df.values 
    minl = (arr > 6.5).argmax(0) 
    maxl = (arr>9).argmax(0) 
    R = np.arange(arr.shape[0])[:,None] 
    mask = (R < minl) | (R >= maxl) 
    arr[mask] = np.nan 
    df = pd.DataFrame(arr,columns=df.columns) 

时序:

In [163]: # Create a random array with floating pt numbers between 6 and 10 
    ...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6) 
    ...: 
    ...: # Create copies for testing approaches 
    ...: df1 = df.copy() 
    ...: df2 = df.copy() 


In [164]: %timeit cumsum_app(df1) 
100 loops, best of 3: 16.4 ms per loop 

In [165]: %timeit boolean_indexing_app(df2) 
100 loops, best of 3: 2.09 ms per loop 
2

试试这个:

df 
     A1  A2  A3 
Time      
1  6.305 6.191 5.918 
2  6.507 6.991 6.203 
3  6.407 6.901 6.908 
4  6.963 7.127 7.116 
5  7.227 7.330 7.363 
6  7.445 7.632 7.575 
7  7.710 7.837 7.663 
8  8.904 8.971 8.895 
9  9.394 9.194 8.994 
10 8.803 8.113 9.333 
11 8.783 8.783 8.783 

df2 = df > 6.5 
df = df[df2.cumsum()>0] 
df2 = df > 9 
df = df[~(df2.cumsum()>0)] 

df 
     A1  A2  A3 
Time      
1  NaN NaN NaN 
2  6.507 6.991 NaN 
3  6.407 6.901 6.908 
4  6.963 7.127 7.116 
5  7.227 7.330 7.363 
6  7.445 7.632 7.575 
7  7.710 7.837 7.663 
8  8.904 8.971 8.895 
9  NaN NaN 8.994 
10  NaN NaN NaN 
11  NaN NaN NaN 
+0

谢谢,梅林!我仍然试图围绕简单性来包裹我的头,我会。我是所有这一切的新手,特别是〜技巧。仍然试图让我的思想以矢量化的方式思考。如果你能解释这种优雅,我希望我不会是唯一从中受益的人。再次感谢。 – RageQuilt