解析Python 2.7中巨大的结构化文件

我是Python世界和生物信息学的新手。我正在处理一个几乎50GB的结构化文件来写出它。所以我想从你那里收集一些很棒的建议。解析Python 2.7中巨大的结构化文件

该文件是这样的。（它实际上称为FASTQ_format）

@Machinename:~:Team1:atcatg 1st line. 
atatgacatgacatgaca   2nd line.  
+        3rd line.   
[email protected]#$#%$    4th line.

这四条线按顺序重复。这4条线就像一个团队。而且我有近30个候选DNA序列。例如atgcat，tttagc

什么我做的是有过巨大的文件会每个候选DNA序列，找到一个候选序列是否类似于球队的DNA序列，这意味着允许一个错配到每个（如taaaaa = aaaata），如果他们是相似或相同的，我使用字典来存储他们以后写出来。候选DNA序列的关键。在名单（4条线）的价值由行顺序将它们存储在为了

所以我所做的是：

def myfunction(str1, str2): # to find if they are similar(allowed one mis match) if they are similar, it returns true 

    f = open('hugefile') 
    diction = {} 
    mylist = ['candidate dna sequences1','dna2','dna3','dna4'...] 
    while True: 
     line = f.readline() 
     if not line: 
     break 
     if "machine name" in line: 
     teamseq = line.split(':')[-1] 
     if my function(candidate dna, team dna) == True: 
      if not candidate dna in diction.keys(): 
       diction[candidate dna] = [] 
       diction[candidate dna].append(line) 
       diction[candidate dna].append(line) 
       diction[candidate dna].append(line) 
       diction[candidate dna].append(line) 
      else:   # chances some same team dna are repeated. 
       diction[candidate dna].append(line) 
       diction[candidate dna].append(line) 
       diction[candidate dna].append(line) 
       diction[candidate dna].append(line) 
    f.close() 

    wf = open(hughfile+".out", 'w') 
    for i in candidate dna list: # dna 1 , dna2, dna3 
      wf.write(diction[i] + '\n') 
    wf.close()

我的函数不使用任何全局变量（我想我很高兴与我的功能），而字典变量是一个全局变量，并采取所有的数据以及制作大量的列表实例。代码很简单，但速度如此之慢，以及对CPU和内存的巨大痛苦。尽管我使用pypy。

因此，任何提示按顺序按顺序写出来？

来源

2014-06-27 jk_kim

看看[BioPython]（http://biopython.org/wiki/Main_Page） – Korem

你必须存储整个大文件，还是只写出来？此外，它看起来像你的代码片段缺少引号或下划线。 – cmd

写出来。我刚刚检查了引号等...你能给我更具体的吗？我看着biopython，但不容易如何解析这种格式... –

我建议同时打开输入和输出文件，并在逐步输入时写入输出。就像现在一样，你正在读取50GB的内存然后写出来。这既缓慢又不必要。

伪代码：

with open(huge file) as fin, open(hughfile+".out", 'w') as fout: 
    for line in f: 
     if "machine name" in line: 
      # read the following 4 lines from fin as a record 
      # process that record 
      # write the record to fout 
      # the input record in no longer needed -- allow to be garbage collected...

正如我所概括它，因为他们遇到再配置的前4行记录被写入。如果您需要针对以前的记录参考diction.keys()，请仅保留必需的最小值，作为set()以减少内存数据的总大小。

来源

2014-06-27 14:58:31 dawg

'four_line_list = [f.readline（）for _ in range（4）]' – dawg

非常感谢！ –

解析Python 2.7中巨大的结构化文件

回答

相关问题