2016-11-13 54 views
0

我有人口数据帧像下面 -加快应用功能与数据帧

RegionName  State 2000-01 2000-02 2000-03 2000-04 ... 2016-10 2016-11 2016-12 
0 New York  NY  204  300  300  124 ... 456  566  344 
1 Mountain View CA  204  300  300  124 ... 456  566  344 

有近~10K rows数据集中。对于此数据集,我想从2000 to 2016中为每个季度的年平均人口添加列。

我写了一个函数来apply以如下数据框 -

import numpy as np 
def quarterize(row): 
    quarter_to_months_map = { 
     'q1' : ['01', '02', '03'], 
     'q2' : ['04', '05', '06'], 
     'q3' : ['07', '08', '09'], 
     'q4' : ['10', '11', '12'] 
    } 
    for year in range(2000, 2017): 
     year = '{}'.format(year) 
     for quarter in quarter_to_months_map.keys(): 
      values = [] 
      for month in quarter_to_months_map[quarter]: 
      values.append(row['{}-{}'.format(year, month)]) 
      row['{}{}'.format(year, quarter)] = np.nanmean(values) 
     return row 

df = df.apply(quarterize, axis = 1) 

这工作得很好,但较小的数据集,但~10K数据集,这将需要~10 min。有没有办法让这个更高效,更快?

回答

1

是的。切勿在行上操作,而是在列上操作。

import numpy as np 
import pandas as pd 
import random 

df = pd.DataFrame([[random.randint(150, 300) for x in range(12 * 10)] for _ in range(1000)], 
       columns=['{}-{:02d}'.format(year, month) for month in range(1,13) for year in range(2000, 2010)]) 

quarter_to_months_map = { 
     'q1' : ['01', '02', '03'], 
     'q2' : ['04', '05', '06'], 
     'q3' : ['07', '08', '09'], 
     'q4' : ['10', '11', '12'] 
    } 

for year in range(2000, 2010): 
    for quarter, months in quarter_to_months_map.items(): 
     months = ['{}-{}'.format(year, month) for month in months] 
     df['{}{}'.format(year, quarter)] = df[months].mean(axis=1) 
:沿线的

东西