如何用Python对大文件进行排序？

我在activestate.com上发现了一些有希望的代码来排序大文件。我试图在Ubuntu 10.04上使用默认的Python 2.6.5解释器来运行它。当我尝试在一个小测试文件上运行它时，我得到下面的错误跟踪。我在activestate.com上寻求帮助，但这个线索已经沉默了18个月以上。有没有人看到明显的解决方案？如何用Python对大文件进行排序？

谢谢。

## {{{ http://code.activestate.com/recipes/576755/ (r3) 
# based on Recipe 466302: Sorting big files the Python 2.4 way 
# by Nicolas Lehuen 

import os 
from tempfile import gettempdir 
from itertools import islice, cycle 
from collections import namedtuple 
import heapq 

Keyed = namedtuple("Keyed", ["key", "obj"]) 

def merge(key=None, *iterables): 
    # based on code posted by Scott David Daniels in c.l.p. 
    # http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d 

    if key is None: 
     keyed_iterables = iterables 
    else: 
     keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable) 
          for iterable in iterables] 

    for element in heapq.merge(*keyed_iterables): 
     yield element.obj 


def batch_sort(input, output, key=None, buffer_size=32000, tempdirs=None): 
    if tempdirs is None: 
     tempdirs = [] 
    if not tempdirs: 
     tempdirs.append(gettempdir()) 

    chunks = [] 
    try: 
     with open(input,'rb',64*1024) as input_file: 
      input_iterator = iter(input_file) 
      for tempdir in cycle(tempdirs): 
       current_chunk = list(islice(input_iterator,buffer_size)) 
       if not current_chunk: 
        break 
       current_chunk.sort(key=key) 
       output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024) 
       chunks.append(output_chunk) 
       output_chunk.writelines(current_chunk) 
       output_chunk.flush() 
       output_chunk.seek(0) 
     with open(output,'wb',64*1024) as output_file: 
      output_file.writelines(merge(key, *chunks)) 
    finally: 
     for chunk in chunks: 
      try: 
       chunk.close() 
       os.remove(chunk.name) 
      except Exception: 
       pass

错误跟踪：

Traceback (most recent call last): 
    File "./batch_sort.py", line 108, in <module> 
    batch_sort(args[0],args[1],options.key,options.buffer_size,options.tempdirs) 
    File "./batch_sort.py", line 54, in batch_sort 
    output_file.writelines(merge(key, *chunks)) 
    File "./batch_sort.py", line 30, in merge 
    yield element.obj 
AttributeError: 'str' object has no attribute 'obj'

来源

2012-05-19 tahoar

你不清楚是什么“巨大”的意思，所以我会接受它意味着“巨大的”。如果你真的对大文件进行排序，你可能不想用Python来完成。其解释性与动态存储分配结合在一起可能会使其缓慢进行。去找一个独立的排序工具;这些旨在尽可能快地分类大量数据。 –

好问题。我将“huge”定义为UTF-8文件，其中包含1400万行或更多行，每行平均175个字符，总计介于2.5到7.5 GB之间（许多文件都具有3个字节的UTF-8字符）。替代方法是使用从bash脚本/终端进行Linux排序。该代码的旧版本的性能不错，但这应该会更快。 – tahoar

对于合并的代码不正确。如果不提供密钥，则每个元素都是一个字符串而不是键控元组。

试试这个：

def merge(key=None, *iterables): 
    # based on code posted by Scott David Daniels in c.l.p. 
    # http://groups.google.com/group/comp.lang.python/msg/484f01f1ea3c832d 

    if key is None: 
     for element in heapq.merge(*iterables): 
      yield element 
    else: 
     keyed_iterables = [(Keyed(key(obj), obj) for obj in iterable) 
         for iterable in iterables] 
     for element in heapq.merge(*keyed_iterables): 
      yield element.obj

来源

2012-05-19 14:38:40

优秀！有用。我会做一些测试来了解性能。谢谢。 – tahoar

@tahoar我正在使用相同的脚本来排序巨大的文件。在运行时，出现以下错误：输出文件为open（输出，'wb'，64 * 1024）时，第51行出错： output_file.writelines（merge（key，* chunks））valueerror：关闭文件的I/O操作。你见过这个错误吗？排序工作正常的小文件虽然！ – Think

@想，事实是，我放弃了这个努力。每次我获得更大的文件大小时，我遇到了新问题。因为我只需要Linux上的这个功能，所以我的最终解决方案使用Python的subprocess.Popen（）来调用Linux的“排序”应用程序，所有的问题都消失了。对不起，我不能进一步帮助。 – tahoar

如何用Python对大文件进行排序？

回答

相关问题