运行总数连续相同的值

如何获得熊猫系列中连续运行的1个运行总数？例如，s = pd.Series([5, 1, 4, 1, 1, 2, 3, 1, 1, 1, 4])。我想获得pd.Series([0, 1, 0, 1, 2, 0, 0, 1, 2, 3, 0])。运行总数连续相同的值

（熊猫0.18.0）

来源

2016-03-26 max

你可以尝试用groupby比较cumcount与s1 != 1cumsum：

print s1.groupby((s1 != 1).cumsum()).cumcount() 
0  0 
1  1 
2  0 
3  1 
4  2 
5  0 
6  0 
7  1 
8  2 
9  3 
10 0 
dtype: int64

更好的解释：

df = pd.DataFrame(s1, columns=['orig']) 
df['not1'] = s1 != 1 
df['cumsum'] = (s1 != 1).cumsum() 
df['cumcount'] = s1.groupby((s1 != 1).cumsum()).cumcount() 
#s1.groupby((s1 != 1).cumsum()).cumcount() is same as: 
df['cumcount1'] = df.groupby('cumsum')['orig'].cumcount() 
print df 
    orig not1 cumsum cumcount cumcount1 
0  5 True  1   0   0 
1  1 False  1   1   1 
2  3 True  2   0   0 
3  4 True  3   0   0 
4  1 False  3   1   1 
5  1 False  3   2   2 
6  2 True  4   0   0 
7  3 True  5   0   0 
8  1 False  5   1   1 
9  1 False  5   2   2 
10  1 False  5   3   3 
11  4 True  6   0   0

或者：

print (s1 == 1) * (s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1) 
0  0 
1  1 
2  0 
3  1 
4  2 
5  0 
6  0 
7  1 
8  2 
9  3 
10 0 
dtype: int64

说明：

df = pd.DataFrame(s1, columns=['orig']) 
df['compare_shift'] = s1 != s1.shift() 
df['cumsum'] = (s1 != s1.shift()).cumsum() 
df['cumcount'] = s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1 
df['cumcount1'] = df.groupby('cumsum')['orig'].cumcount() + 1 
df['is1'] = (s1 == 1) 
#True in converted to 1, False to 0 
df['fin'] = (s1 == 1) * (s1.groupby((s1 != s1.shift()).cumsum()).cumcount() + 1) 
print df 
    orig compare_shift cumsum cumcount cumcount1 is1 fin 
0  5   True  1   1   1 False 0 
1  1   True  2   1   1 True 1 
2  3   True  3   1   1 False 0 
3  4   True  4   1   1 False 0 
4  1   True  5   1   1 True 1 
5  1   False  5   2   2 True 2 
6  2   True  6   1   1 False 0 
7  3   True  7   1   1 False 0 
8  1   True  8   1   1 True 1 
9  1   False  8   2   2 True 2 
10  1   False  8   3   3 True 3 
11  4   True  9   1   1 False 0

来源

2016-03-26 06:11:15 jezrael

我假定它需要'穿过行经过一个完整循环'，一个用于'cumsum'，一个用于'groupby'，一个用于'cumcount'（S1 = 1！）。与一种能够一次完成所有事情的（假设）熊猫方法相比，它需要进行4次传递的事实是否会减慢速度？（当然，我知道即使是这样，它仍然比纯python循环要快得多。） – max

我认为它更快/更好，因为使用熊猫函数虽然4次通过。 – jezrael

不是pretiest方式（可能不是最优的），但下面能够完成任务（约4.5倍比其他循环答案更快）：

s = pd.Series([5, 1, 4, 1, 1, 2, 3, 1, 1, 1, 4]) 

def consecutive_n(s, n=1): 
    a = s[s==n].cumsum()[s.index].fillna(0)/n 
    b = a[a.diff() > 1] 
    c = (b.rank() - b)[s.index].fillna(0).cumsum() 
    return (a + c).apply(lambda x: np.where(x<0, 0, x)).astype(int) 

>>> consecutive_n(s, n=1) 
0  0 
1  1 
2  0 
3  1 
4  2 
5  0 
6  0 
7  1 
8  2 
9  3 
10 0 
dtype: int64

关于中间值的一些解释：
a：在整个系列中第1次出现。
c：当一个不同的数字显示在1（或n）之间时，必须向a添加多少“重置”发生次数。返回值：应用lambda忽略由a + c产生的负数。

编辑：略有改变代码，以便它可以用于任何正整数。例如：

>>> t = pd.Series([1, 2, 3, 1, 4, 2, 2, 3, 2, 2, 2, 1]) 
>>> consecutive_n(t, 2) 
0  0 
1  1 
2  0 
3  0 
4  0 
5  1 
6  2 
7  0 
8  1 
9  2 
10 3 
11 0 
dtype: int64

来源

2016-03-26 05:16:07

运行总数连续相同的值

回答

相关问题