2015-04-20 67 views
0

我有以下的列的数据帧:{'day','measurement'}重新取样熊猫数据帧与系数

而且有可能在一天内多次测量(或根本没有测量)

例如:

day  | measurement 
1  |  20.1 
1  |  20.9 
3  |  19.2 
4  |  20.0 
4  |  20.2 

和系数的数组: coef={-1:0.2, 0:0.6, 1:0.2}

我的目标是重新采样d ata并使用系数求平均值(缺失的数据应该省略)。

这是我写来计算

window=[-1,0,-1] 
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum() 
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum() 

对于上面的示例代码,输出应该是:

Day measurement 
1 20.500 
2 19.850 
3 19.425 
4 19.875 

的问题是,代码运行永远和我很确定有更好的方法来重新采样系数。

任何意见将不胜感激!

+0

能否请你帮我了解的相关性如何转化到高于预期的输出?我的理解是,例如,在第4天,你会希望'(0.2 * 19.2 + 0.6 * 20.1)/ 0.8'这是'19.875',而不是'19.97'。如果你能在第4天或第3天计算时走过,那会有帮助。 –

+0

我的错误,谢谢@SAnand –

+0

@UriGoren第2,3天的测量结果如预期的那样准确?我想,你应该更新这些! – Zero

回答

2

这里是一个可能的解决方案,你在找什么:

 # This is your data 
In [2]: data = pd.DataFrame({ 
    ...:  'day': [1, 1, 3, 4, 4], 
    ...:  'measurement': [20.1, 20.9, 19.2, 20.0, 20.2] 
    ...: }) 

     # Pre-compute every day's average, filling the gaps 
In [3]: measurement = data.groupby('day')['measurement'].mean() 

In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1)) 

In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2}) 

     # Create a matrix with the time-shifted measurements 
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()}) 

In [7]: matrix 
Out[7]: 
     -1  0  1 
day 
1  NaN 20.5 NaN 
2 19.2 NaN 20.5 
3 20.1 19.2 NaN 
4  NaN 20.1 19.2 

     # Take a weighted average of the matrix 
In [8]: (matrix * coef).sum(axis=1)/(matrix.notnull() * coef).sum(axis=1) 
Out[8]: 
day 
1 20.500 
2 19.850 
3 19.425 
4 19.875 
dtype: float64