2016-06-01 146 views
0

我想操纵一个熊猫df,这样我就可以计算出一列中某个数据点出现的频率,在另一列发生特定事件之后。下面的伪代码可能总结得最好。任何帮助将非常感激!熊猫列操纵

import datetime 
import time 
import pandas as pd 

# Set number of rows to skip 
rows_to_skip = 0 
# Rows to use after skipped rows 
rows_to_use = 10000 

# Read the file (Adjust arguments accordingly) 
data = pd.read_csv('example.csv',skiprows=rows_to_skip, error_bad_lines=False, nrows=rows_to_use, low_memory=False) 

# Add headers when skipping rows 
data.columns = ["X","Y","Z"] 

# Psuedo Code Below 

for variable in data['X']: 
    if variable > 0: 
     # Count number of times the following conditions are met in all subsequent rows: 
     condition 1) Y > 0 
     condition 2) Z <= Z of the row where variable was > 0 

     # Then I want to add the total count to a new column, and have it in the same row as X when the "variable" > 0. 

任何帮助?

回答

1

假设你想为每一个实例,其中X>0之间的案件数,而不是计数为整个DataFrame其余各X>0后:

您可以创建一个新的column,指示该X>0条件结果为True,.fillna(method='ffill').groupby()。那么您只需要.apply()即可获得grouplen(),其他条件为True

一些样本数据:

df = pd.DataFrame(data=np.random.randint(-10, 10, size=(100, 3)), columns=list('XYZ')) 

    X Y Z 
0 -3 6 -7 
1 -4 -10 -1 
2 9 -10 -9 
3 5 0 -8 
4 -2 1 -8 

步骤如下:

df['condition'] = df.index.to_series().where(df.X > 0).fillna(method='ffill') 
df['count'] = df.groupby('condition').apply(lambda x: len(x[(x.Y>0) & (x.Z > x.Z.iloc[0])])) 

获得:

X Y Z condition count 
0 -3 6 -7  NaN NaN 
1 -4 -10 -1  NaN NaN 
2 9 -10 -9  2.0 0.0 
3 5 0 -8  3.0 0.0 
4 -2 1 -8  3.0 NaN 
5 6 -6 -3  5.0 1.0 
6 0 6 3  5.0 NaN 
7 -6 -7 -6  5.0 NaN 
8 7 -2 -5  8.0 0.0 
9 0 -1 5  8.0 NaN 
10 5 8 -3  10.0 0.0 
11 -2 -2 1  10.0 NaN 
12 3 4 2  12.0 1.0 
13 -5 1 -9  12.0 NaN 
14 -7 2 6  12.0 NaN 
15 1 -10 6  15.0 0.0 
16 1 -8 6  16.0 0.0 
17 -4 -9 -8  16.0 NaN 
18 -9 4 6  16.0 NaN 
19 5 -6 2  19.0 0.0 
20 5 7 -1  20.0 0.0 
21 2 -2 -3  21.0 0.0 
22 -6 -10 -2  21.0 NaN 
23 -7 -9 3  21.0 NaN 
24 -8 7 -8  21.0 NaN 
25 3 -3 6  25.0 0.0 
26 1 -6 -3  26.0 1.0 
27 -4 6 -1  26.0 NaN 
28 6 -4 9  28.0 0.0 
29 -8 2 1  28.0 NaN 
.. .. .. ..  ... ... 
70 -5 7 -6  68.0 NaN 
71 6 6 -7  71.0 1.0 
72 -3 0 3  71.0 NaN 
73 -5 3 2  71.0 NaN 
74 -6 -8 8  71.0 NaN 
75 1 0 -4  75.0 0.0 
76 7 -9 -5  76.0 0.0 
77 1 0 -1  77.0 0.0 
78 5 9 -2  78.0 0.0 
79 -8 -9 -6  78.0 NaN 
80 2 -3 3  80.0 3.0 
81 -7 -5 8  80.0 NaN 
82 -4 -5 -7  80.0 NaN 
83 -3 5 -6  80.0 NaN 
84 -5 1 4  80.0 NaN 
85 -1 6 7  80.0 NaN 
86 -7 4 4  80.0 NaN 
87 -7 -4 -1  80.0 NaN 
88 -2 -8 2  80.0 NaN 
89 4 6 4  89.0 0.0 
90 4 -10 -8  90.0 0.0 
91 -7 -9 5  90.0 NaN 
92 5 3 -1  92.0 0.0 
93 6 6 6  93.0 0.0 
94 9 -2 0  94.0 1.0 
95 -1 5 5  94.0 NaN 
96 2 8 -9  96.0 2.0 
97 -6 7 -4  96.0 NaN 
98 -1 7 -8  96.0 NaN 
99 -4 0 -1  96.0 NaN 
+0

斯特凡,谢谢。很有帮助。但是我试图计算在X> 0的实例之后但在X> 0的下一行之前两个条件同时满足的次数(data.Y> 0和data.Z <= row.Z)。再次感谢! –

+0

这就是我认为的第一个例子。 – Stefan

+0

我已经更新,只留下解决您的问题的第一个例子。请让我知道,如果你有任何问题。 – Stefan