2017-09-19 39 views
2

我正在寻找一种方法来获取熊猫系列并返回新系列,该系列表示之前连续值的数量高于/低于系列中的每一行:pandas - 连续数值高于/低于当前行数

a = pd.Series([30, 10, 20, 25, 35, 15]) 

...应该输出:

Value Higher than streak Lower than streak 
30  0     0 
10  0     1 
20  1     0 
25  2     0 
35  4     0 
15  0     3 

这将允许有人来识别每个“区域最大/最小”的价值是多么重要的时间序列。

在此先感谢。

回答

2

,你将不得不以某种方式与指数进行交互。此解决方案首先查看当前索引处的值之前的任何值,以查看它们是否小于或大于该值,然后将任何值设置为False,其后面有False。它还避免了在DataFrame上创建迭代器,这可能会加速大数据集的操作。

import pandas as pd 
from operator import gt, lt 

a = pd.Series([30, 10, 20, 25, 35, 15]) 

def consecutive_run(op, ser, i): 
    """ 
    Sum the uninterrupted consecutive runs at index i in the series where the previous data 
    was true according to the operator. 
    """ 
    thresh_all = op(ser[:i], ser[i]) 
    # find any data where the operator was not passing. set the previous data to all falses 
    non_passing = thresh_all[~thresh_all] 
    start_idx = 0 
    if not non_passing.empty: 
     # if there was a failure, there was a break in the consecutive truth values, 
     # so get the final False position. Starting index will be False, but it 
     # will either be at the end of the series selection and will sum to zero 
     # or will be followed by all successive True values afterwards 
     start_idx = non_passing.index[-1] 
    # count the consecutive runs by summing from the start index onwards 
    return thresh_all[start_idx:].sum() 


res = pd.concat([a, a.index.to_series().map(lambda i: consecutive_run(gt, a, i)), 
       a.index.to_series().map(lambda i: consecutive_run(lt, a, i))], 
     axis=1) 
res.columns = ['Value', 'Higher than streak', 'Lower than streak'] 
print(res) 

结果:

Value Higher than streak Lower than streak 
0  30     0     0 
1  10     1     0 
2  20     0     1 
3  25     0     2 
4  35     0     4 
5  15     3     0 
+1

谢谢,我不认为我们会找到避免循环的解决方案。 –

+0

更新为使用稍微更有效的求和算法,只需抓取接近的值,然后求和即可。 – benjwadams

0
import pandas as pd 
import numpy as np 

value = pd.Series([30, 10, 20, 25, 35, 15]) 



Lower=[(value[x]<value[:x]).sum() for x in range(len(value))] 
Higher=[(value[x]>value[:x]).sum() for x in range(len(value))] 


df=pd.DataFrame({"value":value,"Higher":Higher,"Lower":Lower}) 

print(df) 





     Lower Higher value 
0  0  0  30 
1  1  0  10 
2  1  1  20 
3  1  2  25 
4  0  4  35 
5  4  1  15 
+0

谢谢你的答案。不幸的是,这个解决方案并没有达到我预期的结果,因为每行只能对它之前的行进行评估。例如从第二个观察结果来看,10低于30 - 因此Lower column = 1,Upper column = 0. –

+0

已编辑我的答案 – 2Obe

+0

也许您必须根据您认为的逻辑更改名称更高和更低 – 2Obe

0

编辑:更新后真正计数连续值。我无法想出一个可行的熊猫解决方案,因此我们又回到了循环。

df = pd.Series(np.random.rand(10000)) 

def count_bigger_consecutives(values): 
    length = len(values) 
    result = np.zeros(length) 
    for i in range(length): 
    for j in range(i): 
     if(values[i]>values[j]): 
     result[i] += 1 
     else: 
     break 
    return result 

%timeit count_bigger_consecutives(df.values) 
1 loop, best of 3: 365 ms per loop 

如果性能是你所关心它是可能的numba,公正,及时编译器为Python代码归档加速。而在这个例子中,你真的能看到numba闪耀:

from numba import jit 
@jit(nopython=True) 
def numba_count_bigger_consecutives(values): 
    length = len(values) 
    result = np.zeros(length) 
    for i in range(length): 
    for j in range(i): 
     if(values[i]>values[j]): 
     result[i] += 1 
     else: 
     break 
    return result 

%timeit numba_count_bigger_consecutives(df.values) 
The slowest run took 543.09 times longer than the fastest. This could mean that an intermediate result is being cached. 
10000 loops, best of 3: 161 µs per loop 
+0

谢谢。非常有趣,我不熟悉expand()。但是,这不完全是预期的行为。我需要知道在我的时间序列中连续过去的观察值的最大数目,它仍然会使当前行= max()或min()。 –

+0

@BrunoVieira我更新了我的解决方案。 –

+0

哇。这要快得多。感谢分享这个解决方案。不幸的是,结果出现为数组([0.,0,0,0,0.4,0。]),而我期望0,0,1,2,4,0。因为它看起来像解决方案仍然需要一个循环,你使用numba的建议仍然非常有用。 –

0

这里有一个同事想出了一个解决方案(可能不是最有效的,但它的伎俩):

输入数据

a = pd.Series([30, 10, 20, 25, 35, 15]) 

创建 '更高' 列

b = [] 

for idx, value in enumerate(a): 
    count = 0 
    for i in range(idx, 0, -1): 
     if value < a.loc[i-1]: 
      break 
     count += 1 
    b.append([value, count]) 

higher = pd.DataFrame(b, columns=['Value', 'Higher']) 

创建 '下' 列

c = [] 

for idx, value in enumerate(a): 
    count = 0 
    for i in range(idx, 0, -1): 
     if value > a.loc[i-1]: 
      break 
     count += 1 
    c.append([value, count]) 

lower = pd.DataFrame(c, columns=['Value', 'Lower']) 

合并这两个新系列

print(pd.merge(higher, lower, on='Value')) 

    Value Higher Lower 
0  30  0  0 
1  10  0  1 
2  20  1  0 
3  25  2  0 
4  35  4  0 
5  15  0  3 
1

这是我的解决方案 - 它有一个循环,但迭代的次数只会是最大连胜长度。它存储了每行的条纹是否已计算的状态,并在完成时停止。它使用移位来测试前一行是否更高/更低,并继续增加移位直到找到所有条纹。

a = pd.Series([30, 10, 20, 25, 35, 15, 15]) 

a_not_done_greater = pd.Series(np.ones(len(a))).astype(bool) 
a_not_done_less = pd.Series(np.ones(len(a))).astype(bool) 

a_streak_greater = pd.Series(np.zeros(len(a))).astype(int) 
a_streak_less = pd.Series(np.zeros(len(a))).astype(int) 

s = 1 
not_done_greater = True 
not_done_less = True 

while not_done_greater or not_done_less: 
    if not_done_greater: 
     a_greater_than_shift = (a > a.shift(s)) 
     a_streak_greater = a_streak_greater + (a_not_done_greater.astype(int) * a_greater_than_shift) 
     a_not_done_greater = a_not_done_greater & a_greater_than_shift 
     not_done_greater = a_not_done_greater.any() 

    if not_done_less: 
     a_less_than_shift = (a < a.shift(s)) 
     a_streak_less = a_streak_less + (a_not_done_less.astype(int) * a_less_than_shift) 
     a_not_done_less = a_not_done_less & a_less_than_shift 
     not_done_less = a_not_done_less.any() 

    s = s + 1 


res = pd.concat([a, a_streak_greater, a_streak_less], axis=1) 
res.columns = ['value', 'greater_than_streak', 'less_than_streak'] 
print(res) 

既然你在以前的值向后看,看是否有连续的值,这给数据框

value greater_than_streak less_than_streak 
0  30     0     0 
1  10     0     1 
2  20     1     0 
3  25     2     0 
4  35     4     0 
5  15     0     3 
6  15     0     0 
相关问题