提示优化Python的过滤程序

我一直在一个非常简单的程序，用它的要点如下：提示优化Python的过滤程序

post = open(INPUTFILE1, "rb") 
    for line in post: 
     cut = line.split(',') 
     pre = open(INPUTFILE2, "rb") 
     for otherline in pre: 
      cuttwo = otherline.split(',') 
      if cut[1] == cuttwo[1] and cut[3] == cuttwo[3] and cut[9] == cuttwo[9]: 
       OUTPUTFILE.write(otherline) 
       break 
    post.close() 
    pre.close() 
    OUTPUTFILE.close()

有效地这样做是需要两个CSV文件作为输入（“预”和“帖子”）。它会查看“post”数据的第一行，并尝试在第2，4和10列上匹配的“pre”数据中查找一行。如果匹配，则将“pre”数据写入新文件。

它工作得很好，但它需要永远。尽管我的“帖子”数据可能只有几百行（最多可能有一千条），但我的“pre”数据可能多达1500万。因此，它可能需要10个小时左右才能完成。

我对Python很新，所以我还没有学到很多优化技术。有没有人有任何指示我可以尝试？显然我明白，当我搜索整个“pre”数据进行比赛时，logjam正在发生。有没有办法加快这一点？

来源

2012-12-09 user1889922

这是更为[代码审查]（http://codereview.stackexchange.com/）比SO的问题。 –

如果你只得到了几百行均势，然后使用类似：

from operator import itemgetter 
key = itemgetter(1, 3, 9) 
with open('smallfile') as fin: 
    valid = set(key(line.split(',')) for line in fin) 

with open('largerfile') as fin: 
    lines = (line.split(',') for line in fin) 
    for line in lines: 
     if key(line) in valid: 
      # do something....

这样可以节省不必要的重复，使最Python的内置为有效的查找。

如果你想使用的小文件的整个线路中的输出，如果有匹配，则使用字典而不是一组：

from operator import itemgetter 
key = itemgetter(1, 3, 9) 
with open('smallfile') as fin: 
    valid = dict((key(line.split(',')), line) for line in fin)

然后你处理循环会是这样：

with open('largerfile') as fin: 
    lines = (line.split(',') for line in fin) 
    for line in lines: 
     otherline = valid.get(key(line), None) 
     if otherline is not None: 
      # do something....

来源

2012-12-09 18:58:32

+1。不要重复处理文件，只做一次并缓存结果。 @JonClements - 我已经添加了一个使用字典的例子，如果以后需要可以检索整行。希望与你合作。 – Blair

提示优化Python的过滤程序

回答

相关问题