我怎样才能让我的Python代码的运行速度

我对遍历多个文件的NetCDF（大〜28G）代码工作。 netcdf文件在整个域中具有多个4D变量[时间，东西，南北，高度]。目标是循环这些文件并遍历域中所有这些变量的每个位置，并将某些变量存储到一个大型数组中。当缺少或不完整的文件时，我用99.99填充值。现在我只是通过循环测试每日2个netcdf文件进行测试，但由于某种原因，它正在永久（〜14小时）。我不确定是否有方法来优化此代码。我不认为python应该花这么长时间来完成这个任务，但也许这是python或我的代码的问题。下面是我的代码希望它是可读的，如何使这个更快的任何建议是极大的赞赏：我怎样才能让我的Python代码的运行速度

#Domain to loop over 
k_space = np.arange(0,37) 
j_space = np.arange(80,170) 
i_space = np.arange(200,307) 

predictors_wrf=[] 
names_wrf=[] 

counter = 0 
cdate = start_date 
while cdate <= end_date: 
    if cdate.month not in month_keep: 
     cdate+=inc 
     continue 
    yy = cdate.strftime('%Y')   
    mm = cdate.strftime('%m') 
    dd = cdate.strftime('%d') 
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
        if os.path.isfile(filename): 
         f = nc.Dataset(filename,'r') 
         times = f.variables['Times'][1:] 
         num_lines = times.shape[0] 
         if num_lines == 144: 
          u = f.variables['U'][1:,k,j,i] 
          v = f.variables['V'][1:,k,j,i] 
          wspd = np.sqrt(u**2.+v**2.) 
          w = f.variables['W'][1:,k,j,i] 
          p = f.variables['P'][1:,k,j,i] 
          t = f.variables['T'][1:,k,j,i] 
         if num_lines < 144: 
          print "partial files for WRF: "+ filename 
          u = np.ones((144,))*99.99 
          v = np.ones((144,))*99.99 
          wspd = np.ones((144,))*99.99 
          w = np.ones((144,))*99.99 
          p = np.ones((144,))*99.99 
          t = np.ones((144,))*99.99 
        else: 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
         counter=counter+1 
        predictors_wrf.append(u) 
        predictors_wrf.append(v) 
        predictors_wrf.append(wspd) 
        predictors_wrf.append(w) 
        predictors_wrf.append(p) 
        predictors_wrf.append(t) 
        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) 
        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) 
        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) 
        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) 
        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) 
        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) 
        names_wrf.append(u_names) 
        names_wrf.append(v_names) 
        names_wrf.append(wspd_names) 
        names_wrf.append(w_names) 
        names_wrf.append(p_names) 
        names_wrf.append(t_names) 
    cdate+=inc

来源

2017-02-22 HM14

可以使用多在同一时间处理的文件。安排K，J，我对空间不同的工艺，让他们每个人做自己的任务 – haifzhan

什么是'nc.Dataset'？另外，在你提高速度之前，你需要知道为什么它很慢。您需要分析您的代码并*测量*。 –

这是我NetCDF文件中读取如何使用Python我有一份声明早些时候在这里没有显示的代码：进口netCDF4数控 – HM14

这是收紧你的forloop个跛脚第一遍。由于每个文件只使用一次文件形状，因此可以将循环移出到循环外，这将减少中断处理中的数据加载量。我仍然没有得到什么counter和inc做，因为它们似乎没有在循环中更新。你一定要寻找到重复的字符串连接性能，或者你追加到predictors_wrf和names_wrf性能外观为出发点

k_space = np.arange(0,37) 
j_space = np.arange(80,170) 
i_space = np.arange(200,307) 

predictors_wrf=[] 
names_wrf=[] 

counter = 0 
cdate = start_date 
while cdate <= end_date: 
    if cdate.month not in month_keep: 
     cdate+=inc 
     continue 
    yy = cdate.strftime('%Y')   
    mm = cdate.strftime('%m') 
    dd = cdate.strftime('%d') 
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
    file_exists = os.path.isfile(filename) 
    if file_exists: 
     f = nc.Dataset(filename,'r') 
     times = f.variables['Times'][1:] 
     num_lines = times.shape[0] 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
        if file_exists:  
         if num_lines == 144: 
          u = f.variables['U'][1:,k,j,i] 
          v = f.variables['V'][1:,k,j,i] 
          wspd = np.sqrt(u**2.+v**2.) 
          w = f.variables['W'][1:,k,j,i] 
          p = f.variables['P'][1:,k,j,i] 
          t = f.variables['T'][1:,k,j,i] 
         if num_lines < 144: 
          print "partial files for WRF: "+ filename 
          u = np.ones((144,))*99.99 
          v = np.ones((144,))*99.99 
          wspd = np.ones((144,))*99.99 
          w = np.ones((144,))*99.99 
          p = np.ones((144,))*99.99 
          t = np.ones((144,))*99.99 
        else: 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
         counter=counter+1 
        predictors_wrf.append(u) 
        predictors_wrf.append(v) 
        predictors_wrf.append(wspd) 
        predictors_wrf.append(w) 
        predictors_wrf.append(p) 
        predictors_wrf.append(t) 
        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) 
        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) 
        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) 
        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) 
        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) 
        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) 
        names_wrf.append(u_names) 
        names_wrf.append(v_names) 
        names_wrf.append(wspd_names) 
        names_wrf.append(w_names) 
        names_wrf.append(p_names) 
        names_wrf.append(t_names) 
    cdate+=inc

来源

2017-02-22 04:38:36 Selecsosi

我没有很多的建议，但几件事情要注意。

不要打开文件这么多次

首先，定义这个filename变量，然后这个循环里（内心深处：三for循环深），你如果该文件存在，并检查想必打开它那里（我不知道是什么nc.Dataset做，但我猜它必须打开该文件，并读取它）：

filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
        if os.path.isfile(filename): 
         f = nc.Dataset(filename,'r')

这将是非常低效。如果文件在所有循环之前没有更改，您肯定可以打开一次。

尝试for循环

所有这些嵌套的for循环的复合您需要执行的操作次数使用少。一般建议：尝试使用numpy操作。

使用CPROFILE

如果你想知道为什么你的程序需要很长的时间，找出最好的方法之一就是轮廓他们。

来源

2017-02-22 04:29:54 erewok

对于你的问题，我认为multiprocessing将有很大的帮助。我浏览了你的代码，并在这里得到了一些建议。

不使用开始时间，而是使用文件名作为代码中的迭代器。

换行功能，找出基于时间的所有文件名，并返回所有文件名列表。

def fileNames(start_date, end_date): 
    # Find all filenames. 
    cdate = start_date 
    fileNameList = [] 
    while cdate <= end_date: 
     if cdate.month not in month_keep: 
      cdate+=inc 
      continue 
     yy = cdate.strftime('%Y')   
     mm = cdate.strftime('%m') 
     dd = cdate.strftime('%d') 
     filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
     fileNameList.append(filename) 
     cdate+=inc 

    return fileNameList

包装你的代码，你的数据并填写99。99，函数的输入是文件名。

def dataExtraction(filename): 
    file_exists = os.path.isfile(filename) 
    if file_exists: 
     f = nc.Dataset(filename,'r') 
     times = f.variables['Times'][1:] 
     num_lines = times.shape[0] 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
       if file_exists:  
        if num_lines == 144: 
         u = f.variables['U'][1:,k,j,i] 
         v = f.variables['V'][1:,k,j,i] 
         wspd = np.sqrt(u**2.+v**2.) 
         w = f.variables['W'][1:,k,j,i] 
         p = f.variables['P'][1:,k,j,i] 
         t = f.variables['T'][1:,k,j,i] 
        if num_lines < 144: 
         print "partial files for WRF: "+ filename 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
        else: 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
         counter=counter+1 
        predictors_wrf.append(u) 
        predictors_wrf.append(v) 
        predictors_wrf.append(wspd) 
        predictors_wrf.append(w) 
        predictors_wrf.append(p) 
        predictors_wrf.append(t) 
        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) 
        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) 
        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) 
        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) 
        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) 
        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) 
        names_wrf.append(u_names) 
        names_wrf.append(v_names) 
        names_wrf.append(wspd_names) 
        names_wrf.append(w_names) 
        names_wrf.append(p_names) 
        names_wrf.append(t_names) 


    return zip(predictors_wrf, names_wrf)

使用多处理来完成您的工作。一般来说，所有的计算机都有一个以上的CPU核心。当有大量CPU计算时，多处理将有助于提高速度。根据我以前的经验，多处理会减少大数据集消耗2/3时间。

更新：再次测试于2017年2月25日我的代码的文件后，我发现，使用8芯的为一个巨大的数据集为我节省了90％的收缩时间。
```
if __name__ == '__main__': 
     from multiprocessing import Pool # This should be in the beginning statements. 
     start_date = '01-01-2017' 
     end_date = '01-15-2017' 
     fileNames = fileNames(start_date, end_date) 
     p = Pool(4) # the cores numbers you want to use. 
     results = p.map(dataExtraction, fileNames) 
     p.close() 
     p.join() 
```
最后，请注意这里的数据结构，因为它是相当复杂的。希望这可以帮助。如果您还有其他问题，请留下评论。

来源

2017-02-22 15:55:48

我怎样才能让我的Python代码的运行速度

回答

相关问题