2017-09-13 24 views
2

假设我有一个像这样的Pandas系列布尔值。增加阵列中的连续正数组/

vals = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1]).astype(bool) 

>>> vals 
0  False 
1  False 
2  False 
3  True 
4  True 
5  True 
6  True 
7  False 
8  False 
9  True 
10  True 
11 False 
12  True 
13  True 
14  True 
dtype: bool 

我想打开这个布尔系列为一系列其中每个组的1的适当列举,像这样

0  0 
1  0 
2  0 
3  1 
4  1 
5  1 
6  1 
7  0 
8  0 
9  2 
10 2 
11 0 
12 3 
13 3 
14 3 

我怎么能这样做有效地


我已经能够手动这样做了,循环遍历Python级别的序列并递增,但是这显然很慢。我正在寻找一个矢量化的解决方案 - 我看到this answer from unutbu涉及在NumPy中增加群组的分裂,并试图让它与某种cumsum一起工作,但目前为止尚未成功。

回答

3

你可以试试这个:

vals.astype(int).diff().fillna(vals.iloc[0]).eq(1).cumsum().where(vals, 0) 

#0  0 
#1  0 
#2  0 
#3  1 
#4  1 
#5  1 
#6  1 
#7  0 
#8  0 
#9  2 
#10 2 
#11 0 
#12 3 
#13 3 
#14 3 
#dtype: int64 
1
m=(vals.diff().ne(0)&vals.ne(0)).cumsum() 
m[vals.eq(0)]=0 
m 
Out[235]: 
0  0 
1  0 
2  0 
3  1 
4  1 
5  1 
6  1 
7  0 
8  0 
9  2 
10 2 
11 0 
12 3 
13 3 
14 3 
dtype: int32 

数据输入

vals = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1]) 
3

这里有一个NumPy的方法 -

def island_same_label(vals): 

    # Get array for faster processing with NumPy tools, ufuncs 
    a = vals.values 

    # Initialize output array 
    out = np.zeros(a.size, dtype=int) 

    # Get start indices for each island of 1s. Set those as 1s 
    out[np.flatnonzero(a[1:] > a[:-1])+1] = 1 

    # In case 1st element was True, we would have missed it earlier, so add that 
    out[0] = a[0] 

    # Finally cumsum and mask out non-island regions 
    np.cumsum(out, out=out) 
    return pd.Series(np.where(a, out, 0)) 

使用日e样品和平铺多次输入 -

In [15]: vals=pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1]).astype(bool) 

In [16]: vals = pd.Series(np.tile(vals,10000)) 

In [17]: %timeit Psidom_app(vals) # @Psidom's soln 
    ...: %timeit Wen_app(vals) # @Wen's soln 
    ...: %timeit island_same_label(vals) # Proposed in this post 
    ...: 
100 loops, best of 3: 9.53 ms per loop 
100 loops, best of 3: 13.2 ms per loop 
1000 loops, best of 3: 959 µs per loop