0
我有两个大文本文件(现在是17MB,但可能是GB),因此我不想将它们加载到内存中,因为它们的大小可能会超过我的内存容量。使用python快速比较2个文本文件
我写了现在的代码是这样的:
def stopIfFileExist(filename):
if os.path.isfile(filename):
raise Exception("%s already exist" %filename)
def compareDump(before_filename, after_filename, diff_filename):
"""
Compare 2 dumps generated via makeDump(output_filename) and generate
a file containing the differences
-before_filename : (string) filename of the first dump
-after_filename : (string) filename of the second dump
-diff_filename : (string) filename of the diff
"""
stopIfFileExist(diff_filename)
num_lines = sum(1 for line in open(after_filename))
one_percent = num_lines/float(100)
diff = []
start = time.time()
with open(after_filename, "r") as afterFile:
counter = 0
for a_line in afterFile:
print "completion : %.9f percents" %(counter/float(one_percent))
counter = counter + 1
diff.append(a_line)
with open(before_filename, "r") as beforeFile:
for b_line in beforeFile:
if a_line.rstrip() == b_line.rstrip():
diff.pop()
break
end = time.time()
print "task completed in %s seconds" %(end - start)
with open(diff_filename, "a") as diffFile:
for line in diff:
diffFile.write(line)
我想要做的是从beforeFile
这是成功地比较(例如,当if a_line.rstrip() == b_line.rstrip():
被触发)
但是,因为我目前正在阅读文件,我不知道该怎么做。
任何想法?
谢谢。
您正在读取另一个GB文件中每行的GB文件。这永远不会很快。考虑文件的内容以找到更有效的解决方案。如果没有,请考虑数据库。 – pacholik
让我重新说明一下:'最快的方式来比较两个大文件,如果我能够删除已经'找到'的一行,那么下一次迭代会花费一点时间,等等。 –
这仍然是* O(n²)*。 – pacholik