2016-06-07 39 views
0

我有两个大文本文件(现在是17MB,但可能是GB),因此我不想将它们加载到内存中,因为它们的大小可能会超过我的内存容量。使用python快速比较2个文本文件

我写了现在的代码是这样的:

def stopIfFileExist(filename): 
    if os.path.isfile(filename): 
     raise Exception("%s already exist" %filename) 

def compareDump(before_filename, after_filename, diff_filename): 
    """ 
    Compare 2 dumps generated via makeDump(output_filename) and generate 
    a file containing the differences 
     -before_filename : (string) filename of the first dump 
     -after_filename : (string) filename of the second dump 
     -diff_filename : (string) filename of the diff 
    """ 

    stopIfFileExist(diff_filename) 

    num_lines = sum(1 for line in open(after_filename)) 
    one_percent = num_lines/float(100) 

    diff = [] 

    start = time.time() 

    with open(after_filename, "r") as afterFile: 
     counter = 0 
     for a_line in afterFile: 
      print "completion : %.9f percents" %(counter/float(one_percent)) 
      counter = counter + 1 
      diff.append(a_line) 
      with open(before_filename, "r") as beforeFile: 
       for b_line in beforeFile: 
        if a_line.rstrip() == b_line.rstrip(): 
         diff.pop() 
         break 

    end = time.time() 
    print "task completed in %s seconds" %(end - start) 

    with open(diff_filename, "a") as diffFile: 
     for line in diff: 
      diffFile.write(line) 

我想要做的是从beforeFile这是成功地比较(例如,当if a_line.rstrip() == b_line.rstrip():被触发)

行删除哪些

但是,因为我目前正在阅读文件,我不知道该怎么做。

任何想法?

谢谢。

+0

您正在读取另一个GB文件中每行的GB文件。这永远不会很快。考虑文件的内容以找到更有效的解决方案。如果没有,请考虑数据库。 – pacholik

+0

让我重新说明一下:'最快的方式来比较两个大文件,如果我能够删除已经'找到'的一行,那么下一次迭代会花费一点时间,等等。 –

+1

这仍然是* O(n²)*。 – pacholik

回答

-1

使用下面的测试代码,我能够在3分多钟的时间内比较两个20兆字节的文件。

每隔1万行我就放一个随机数,你可以在结果中看到不同的数字。

import random 
import difflib 
import os 
import time 

start = time.time() 

NUM_LINES = int(10000000/4) 
t1 = 'test1' 
t2 = 'test2' 

if os.path.exists(t1): 
    os.remove(t1) 
if os.path.exists(t2): 
    os.remove(t2) 

with open(t1, 'w+') as f1: 
    for number in range(1, NUM_LINES): 
     if number % 10000 == 0: 
      r = random.randint(1, number) 
     else: 
      r = 1 
     f1.write(str(number * r) + '\n') 
    else: 
     f1.seek(0) 

    with open(t2, 'w+') as f2: 
     for number in range(1, NUM_LINES): 
      if number % 10000 == 0: 
       r = random.randint(1, number) 
      else: 
       r = 1 
      f2.write(str(number * r) + '\n') 
     else: 
      f2.seek(0) 

     t1 = f1.readlines() 
     t2 = f2.readlines() 

for l in difflib.unified_diff(t1, t2, lineterm=''): 
    print(l.strip()) 

print('Execution took: {:.2f} seconds'.format(time.time() - start)) 

I pasted the output on github,因为它很猥琐。