2016-08-07 105 views
0

我使用Python。我有100个zip文件。每个zipfile包含超过100个xmlfiles。使用xmlfiles我创建csvfiles。Python,多处理:如何优化代码?让代码更快?

from xml.etree.ElementTree import fromstring 
import zipfile 
from multiprocessing import Process 

def parse_xml_for_csv1(data, writer1): 
    root = fromstring(data) 
    for node in root.iter('name'): 
     writer1.writerow(node.get('value')) 

def create_csv1(): 
    with open('output1.csv', 'w') as f1: 
     writer1 = csv.writer(f1) 

     for i in range(1, 100): 
      z = zipfile.ZipFile('xml' + str(i) + '.zip') 
      # z.namelist() contains more than 100 xml files 
      for finfo in z.namelist(): 
       data = z.read(finfo) 
       parse_xml_for_csv1(data, writer1) 


def create_csv2(): 
    with open('output2.csv', 'w') as f2: 
     writer2 = csv.writer(f2) 

     for i in range(1, 100): 
      ... 


if __name__ == "__main__": 
    p1 = Process(target=create_csv1) 
    p2 = Process(target=create_csv2) 
    p1.start() 
    p2.start() 
    p1.join() 
    p2.join() 

请告诉我,如何优化我的代码?让代码更快?

+1

每个未压缩的xml文件有多大?你正在写的csvs? – goncalopp

+0

goncalopp,xml文件很小(约10行)。我只需要2个CSV文件。 – Olga

+0

我会使用lxml来完成处理,并尽可能在c级尽可能多地处理它http://lxml.de/FAQ.html#id1 –

回答

2

你只需要用参数定义一个方法。 在给定数量的线程或进程中拆分100个.zip文件的处理。您将添加的进程越多,您将使用的CPU越多,并且可能可以使用多于2个进程,速度会更快(由于某些点的磁盘I/O可能会出现瓶颈)

在下面的代码中,我可以更改为4或10个进程,无需复制/粘贴代码。它处理不同的zip文件。

您的代码并行处理两个相同的100个文件:它比没有多处理时更慢!

def create_csv(start_index,step): 
    with open('output{0}.csv'.format(start_index//step), 'w') as f1: 
     writer1 = csv.writer(f1) 

     for i in range(start_index, start_index+step): 
      z = zipfile.ZipFile('xml' + str(i) + '.zip') 
      # z.namelist() contains more than 100 xml files 
      for finfo in z.namelist(): 
       data = z.read(finfo) 
       parse_xml_for_csv1(data, writer1) 



if __name__ == "__main__": 
    nb_files = 100 
    nb_processes = 2 # raise to 4 or 8 depending on your machine 

    step = nb_files//nb_processes 
    lp = [] 
    for start_index in range(1,nb_files,step): 
     p = Process(target=create_csv,args=[start_index,step]) 
     p.start() 
     lp.append(p) 
    for p in lp: 
     p.join()