2017-06-16 75 views
1

我有一个明文文件,我想分割成多个文件。该文件的格式是这样的:使用递归来分割基于python分隔符的文本文件

-----BEGIN CERTIFICATE----- 
text1 
text2 
text3 
-----END CERTIFICATE----- 
-----BEGIN CERTIFICATE----- 
text4 
text5 
text6 
-----END CERTIFICATE----- 
-----BEGIN CERTIFICATE----- 
text7 
text8 
text9 
-----END CERTIFICATE----- 
-----BEGIN CERTIFICATE----- 
text10 
text11 
text12 
-----END CERTIFICATE----- 

欲每个块从分裂(并包括)开始(并包括)END。

这是我至今写:

with open('/Users/arl/Downloads/bundle.pem', 'r') as cert_file: 
    cert = cert_file.readlines() 

def parse_file(filename=None, variable=None): 
    with open(filename, "w") as variable: 
     for line in cert: 
      if "BEGIN" in line: 
       variable.write(line) 
       continue 
      elif "END" in line: 
       variable.write(line) 
       parse_file(filename="int1.pem", variable="int1_file") 
       parse_file(filename="int2.pem", variable="int2_file") 
       parse_file(filename="end.pem", variable="end_file") 
      print line.rstrip() 
      variable.write(line) 
     variable.close() 

parse_file(filename="root.pem", variable="root_file") 

错误我目前得到:

parse_file(filename="int1.pem", variable="int1_file") 
    File "splitter.py", line 12, in parse_file 
    parse_file(filename="int1.pem", variable="int1_file") 
    File "splitter.py", line 17, in parse_file 
    variable.close() 
RuntimeError: maximum recursion depth exceeded while calling a Python object 

而且只有root.pemint1.pem被写入(并且都具有相同的内容,这是他们不应该)

为了解析文件并将每个新块写入新文件,我需要做什么?在循环中,函数使用新参数调用自身的正确点是什么?

感谢

+0

你的函数只真正从全局文件(证书)进行读取,因此您的递归调用简单地让它从文件中读取一遍又一遍,因此无限递归。 –

+0

完全不清楚你想要做什么以及为什么要使用递归。另外,你用文件指针覆盖'variable',这样这个参数将不起作用。你是否想要root.pem,int1.pem,int2.pem和end.pem中的每一个都包含bundle.pem中的一个部分? – Stuart

+0

@AlanLeuthard:是的,我现在明白了。我试图弄清楚如何从我上次完成的地方继续阅读,而不是从文件的开始处开始阅读。 – ARL

回答

1

我看不到递归在这里很有用 - 相反,你可以使输出的文件名列表,并使用iter遍历它们和next,在遇到“BEGIN”时打开文件,然后在遇到“END”时关闭相同的文件。

def parse_file(input_file, output_files): 
    filenames = iter(output_files) 
    with open(input_file, 'r') as cert_file: 
     for line in cert_file: 
      if "BEGIN" in line: 
       output = open(filenames.next(), 'w') 
      output.write(line) 
      if "END" in line: 
       output.close() 
    output.close() # just in case not already closed 

input_file = '/Users/arl/Downloads/bundle.pem' 
output_files = ['root.pem', 'int1.pem', 'int2.pem', 'end.pem'] 
parse_file(input_file=input_file, output_files=output_files) 

如果'BEGIN'和'END'之间有任何空格或其他内容,则会引发错误。如果这是一个问题,您可以添加一行来检查输出文件是否已打开。

def parse_file(input_file, output_files): 
    filenames = iter(output_files) 
    output = None 
    with open(input_file, 'r') as cert_file: 
     for line in cert_file: 
      if "BEGIN" in line: 
       output = open(filenames.next(), 'w') 
      if output and not output.closed: 
       output.write(line) 
      if "END" in line: 
       output.close() 
    output.close() 

或等效,使用嵌套循环:

def parse_file(input_file, output_files): 
    output = None 
    with open(input_file, 'r') as cert_file: 
     for output_file in output_files: 
      for line in cert_file: 
       if "BEGIN" in line: 
        output = open(output_file, 'w') 
       if output and not output.closed: 
        output.write(line) 
       if "END" in line: 
        output.close() 
        break # breaks out of inner loop and gets next output_file 
    output.close() 
1

通过正则表达式:

import re 

content = ''' 
-----BEGIN CERTIFICATE----- 
text1 
text2 
text3 
-----END CERTIFICATE----- 
-----BEGIN CERTIFICATE----- 
text4 
text5 
text6 
-----END CERTIFICATE----- 
-----BEGIN CERTIFICATE----- 
text7 
text8 
text9 
-----END CERTIFICATE----- 
-----BEGIN CERTIFICATE----- 
text10 
text11 
text12 
-----END CERTIFICATE----- 
''' 

content = content.strip('\n') 

pattern = re.compile('\-\-\-\-\-BEGIN CERTIFICATE\-\-\-\-\-((.|\n)*?)\-\-\-\-\-END CERTIFICATE\-\-\-\-\-') 
certs = re.findall(pattern, content) 
for cert in certs: 
    cert_content = cert[0].strip('\n') 
    print cert_content 
    print 
0

类似另一种答案,但允许BEGIN和END之间更加细分,而不需要你手动列出文件的名称。该脚本重命名它输出的最终文件。正如其他人所说的那样,不需要递归。 (递归会让你疯了。)

collect = False 
file_number = -1 
with open('big_file.txt') as big: 
    for line in big.readlines(): 
     if line.startswith('-----BEGIN'): 
      collect = True 
      file_number += 1 
      little = open('int%s.pem' % file_number, 'w') 
      continue 
     elif line.startswith('-----END'): 
      little.close() 
      collect = False 
     else: 
      little.write(line) 

import os 
os.rename('int%s.pem' % file_number, 'end.pem')