2017-05-22 30 views
1

我真的是python的初学者,但试图将从两个数据库中提取的一些数据比较成文件。在脚本中,我为每个数据库内容使用了一本字典,如果我发现它有所不同,我将它添加到字典中。它们的键是前两个值(代码和子代码)的组合,并且该值是与该代码/子代码组合关联的longCodes的列表。总的来说,我的脚本是有效的,但如果它的构造非常可怕并且效率低下,我不会感到惊讶。为处理样本数据是这样的:发现两个文件之间的差异很慢

0,0,83 
0,1,157 
1,1,158 
1,2,159 
1,3,210 
2,0,211 
2,1,212 
2,2,213 
2,2,214 
2,2,215 

的想法是,数据应该是同步的,但有时它是不是和我想检测的差异。实际上,当我从数据库中提取数据时,每个文件中有超过100万行。性能看起来并不是那么好(可能它的性能可能不错),需要大约35分钟来处理并给出结果。如果有任何改善性能的建议,我会很乐意接受!

import difflib, sys, csv, collections 

masterDb = collections.OrderedDict() 
slaveDb = collections.OrderedDict() 
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2: 
    diff = difflib.ndiff(f1.readlines(),f2.readlines()) 
    for line in diff: 
     if line.startswith('-'): 
      line = line[2:] 
      codeSubCode = ",".join(line.split(",", 2)[:2]) 
      longCode = ",".join(line.split(",", 2)[2:]).rstrip() 
      if not codeSubCode in masterDb: 
       masterDb[codeSubCode] = [(longCode)] 
      else: 
       masterDb[codeSubCode].append(longCode) 
     elif line.startswith('+'): 
      line = line[2:] 
      codeSubCode = ",".join(line.split(",", 2)[:2]) 
      longCode = ",".join(line.split(",", 2)[2:]).rstrip() 
      if not codeSubCode in slaveDb: 
       slaveDb[codeSubCode] = [(longCode)] 
      else: 
       slaveDb[codeSubCode].append(longCode) 

f1.close() 
f2.close() 
+0

我不知道这是否会更快,但在我的[本答案](https://stackoverflow.com/a/4127426/355230)开头定义的'ordereddefaultdict'类到另一个问题会可以让你摆脱这两种情况的每一种情况下以'如果不是xxxDb:中的子代码'开头的四行,并用无条件的'xxxDb..append(longCode)'代替它们。注意你也不需要关闭这两个文件,'with'会自动完成。 – martineau

回答

1

试试这个:

import difflib, sys, csv, collections 

masterDb = collections.OrderedDict() 
slaveDb = collections.OrderedDict() 
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2: 
    diff = difflib.ndiff(f1.readlines(),f2.readlines()) 
    for line in diff: 
     if line.startswith('-'): 
      line = line[2:] 
      sp=",".join(line.split(",", 2)[:2]) 
      codeSubCode = sp 
      longCode = sp.rstrip() 
      try: 
       masterDb[codeSubCode].append(longCode) 
      except: 
       masterDb[codeSubCode] = [(longCode)] 
     elif line.startswith('+'): 
      line = line[2:] 
      sp=",".join(line.split(",", 2)[:2]) 
      codeSubCode = sp 
      longCode = sp.rstrip()    
      try: 
       slaveDb[codeSubCode].append(longCode) 
      except: 
       slaveDb[codeSubCode] = [(longCode)] 

f1.close() 
f2.close() 
+2

您可能需要查明这些更改,可能会用文字解释您更改的内容以及原因。 –

+0

我确实尝试了您的代码更改,处理时间为33分钟,因此处理时间有所改善。感谢您的输入。 – ssbsts

0

所以我结束了使用不同的逻辑来拿出一个更有效的脚本。非常感谢https://stackoverflow.com/users/100297/martijn-pieters的帮助。

#!/usr/bin/python 

import csv, sys, collections 

masterDb = collections.OrderedDict() 
slaveDb = collections.OrderedDict() 
outFile = open('results.csv', 'wb') 

#First find entries in SLAVE that dont match MASTER 
with open('masterDbCodes.lst', 'rb') as master: 
    reader1 = csv.reader(master) 
    master_rows = {tuple(r) for r in reader1} 

with open('slaveDbCodes.lst', 'rb') as slave: 
    reader = csv.reader(slave) 

    for row in reader: 
     if tuple(row) not in master_rows: 
      code = row[0] 
      subCode = row[1] 
      codeSubCode = ",".join([code, subCode]) 
      longCode = row[2] 
      if not codeSubCode in slaveDb: 
       slaveDb[codeSubCode] = [(longCode)] 
      else: 
       slaveDb[codeSubCode].append(longCode) 

#Now find entries in MASTER that dont match SLAVE 
with open('slaveDbCodes.lst', 'rb') as slave: 
    reader1 = csv.reader(slave) 
    slave_rows = {tuple(r) for r in reader1} 

with open('masterDbCodes.lst', 'rb') as master: 
    reader = csv.reader(master) 

    for row in reader: 
     if tuple(row) not in slave_rows: 
      code = row[0] 
      subCode = row[1] 
      codeSubCode = ",".join([code, subCode]) 
      longCode = row[2] 
      if not codeSubCode in masterDb: 
       masterDb[codeSubCode] = [(longCode)] 
      else: 
       masterDb[codeSubCode].append(longCode) 

该解决方案可以在大约10秒内处理数据(实际上两次)。