在大熊猫计算元件的第n个数量的平均值在列

我有以下数据帧：在大熊猫计算元件的第n个数量的平均值在列

   df1 
index year week a  b  c 
-10 2017 10 45 26 19 
    -9 2017 11 37 23 14 
    -8 2017 12 21 66 19 
    -7 2017 13 47 36 92 
    -6 2017 14 82 65 18 
    -5 2017 15 68 68 19 
    -4 2017 16 30 95 24 
    -3 2017 17 21 15 94 
    -2 2017 18 67 30 16 
    -1 2017 19 10 13 13 
    0 2017 20 26 22 18 
    1 2017 21 NaN NaN NaN 
    2 2017 22 NaN NaN NaN 
    3 2017 23 NaN NaN NaN 
    4 2017 24 NaN NaN NaN 
    ... 
    53 2018 20 NaN NaN NaN

我需要为每个空单元来计算在一列中的第n个先前值的平均值并将该值写入单元格。 n等于从零开始的索引数。例如，对于列a中的第一个空单元格，我必须计算索引0和-10之间的平均值。然后为1和之间的下一个单元格等。对于列号a,b和c也是如此。而计算总是从index = 1开始。

而问题在于列数如a,b,c可以不同。但我知道这些列将始终在列week之后。是否可以将这些计算应用于无限数量的列，但是如果知道这些列将位于列week之后？

我尽力找到任何东西，但找不到合适的东西。

UPD：如果这有帮助，index = 0的最大行数将是53。

来源

2017-07-04 Yana Dolyuk

当你说“那么对于下一个单元格'1'和'-9'等之间”，这是否意味着A）计算之间'平均-9 '和'0'，并忽略'1'中的'NaN'，或者b）使用在前面的“迭代”中针对'1'计算的新值计算'-9'和'1'之间的平均值？ – jdehesa

@jdehesa，是的，我需要在单元格'1'中使用一个新值，就像你在b）中所描述的那样。 –

你可以实际使用loc切片运算符，然后下降到只得到a，b，c列（df1.loc [：，'week'：]。drop（'week'，axis = 1）。我认为没有纯粹的熊猫解决方案（除非一些熊猫魔术师提出）来做移动平均思维（因为你想平均在先前计算的平均值），你可能必须使用python循环。如果性能很关键，你可以看一看cython或numba来加速循环。 –

你可以用熊猫和numpy玩一下，假设你知道week列的索引将是什么（即使你不这样做，一个简单的搜索，会得到指数），像例如，week列第3，你可以这样做

import numpy as np 
import pandas as pd 
#data is your dataframe name 
column_list = list(data.columns.values)[3:] 
for column_name in column_list : 
    column = data[column_name].values 
    #converted pandas series to numpy series 
    for index in xrange(0,column.shape[0]): 
     #iterating over entries in the column 
     if np.isnan(column[index]): 
      column[index] = np.nanmean(column.take(range(index-10,index+1),mode='wrap'))

这是一个不好的解决方案，但应该可以正常工作。它将用前面的10个条目替换所有的NaN条目。如果您而不是只有以前的10 想不一个回绕，你干脆把前n n个不足10，像
new_df[index] = np.nanmean(new_df[max(0,index-10):index+1])

希望这有助于！

来源

2017-07-04 12:47:55

这可以像做如下：

n = 11 # in the example of your explanation 
df = df1.loc[range(1,df1.index[-1]+1)] # select rows from index 1 above

df应该是这样的：

 year week a b c 
index       
1  2017 21 NaN NaN NaN 
2  2017 22 NaN NaN NaN 
3  2017 23 NaN NaN NaN 
4  2017 24 NaN NaN NaN

那么你：

for s in list(df.index): # iterate through rows with nan values 
    for i in range(2,df.columns.size): # iterate through different cols ('a','b','c' or more) 
     df1.loc[s,df.columns[i]] = df1.loc[range(s-n,s),df.columns[i]].sum()/n 
print(df1)

请注意，在这种情况下我也跟着你的榜样并假设year将永远是第一列，week总是第二，以便选择week和index之后的所有列。以及指数

输出：

 year week   a   b   c 
index            
-10 2017 10 45.000000 26.000000 19.000000 
-9  2017 11 37.000000 23.000000 14.000000 
-8  2017 12 21.000000 66.000000 19.000000 
-7  2017 13 47.000000 36.000000 92.000000 
-6  2017 14 82.000000 65.000000 18.000000 
-5  2017 15 68.000000 68.000000 19.000000 
-4  2017 16 30.000000 95.000000 24.000000 
-3  2017 17 21.000000 15.000000 94.000000 
-2  2017 18 67.000000 30.000000 16.000000 
-1  2017 19 10.000000 13.000000 13.000000 
0  2017 20 26.000000 22.000000 18.000000 
1  2017 21 41.272727 41.727273 31.454545 
2  2017 22 40.933884 43.157025 32.586777 
3  2017 23 41.291510 44.989482 34.276484 
4  2017 24 43.136193 43.079434 35.665255

来源

2017-07-04 12:48:58

在大熊猫计算元件的第n个数量的平均值在列

回答

相关问题