2013-10-10 54 views
0

我有一个文件,其中有许多字段由“|” (管道)字符。 我想读取此文件并创建与特定字段的值一样多的文件。 这里一个例子:Python:更快的方法来读取和创建文件

L219| |791|P|PIPPO|PLUTO|1|18081926|I262|XYZXCV12D35F345S|| 
L219| |1241|P|PAPERINO|TOPOLINO|2|21041937|F335|FVGHWU54G56S456U|| 
L219| |437793|G|TOPOLANDIA SAS|L219|12345678910| 
L219| |437794|G|PAPERANDIA|L219|10987654321| 

如果第四字段等于“G”,则记录进入“file_pg.txt”,否则,如果它等于“P”变为“file_pf.txt”。

我写下面的代码(我是Python中的新手),但执行具有巨大维度(300mb)的文件需要很长时间,您有任何改进它的建议吗?

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 

i = 0 
with file: 
    for line in file: 
     i = 0 
     c = 0 
     while i < len(line): 
      carattere = line[i] 
      if carattere == "|": 
       c = c + 1 
       if c == 4: 
        if line[i-1] == "P": 
         file_pf = open("D:\\mydirectory\\file_pf.txt","a") 
         file_pf.write(line) 
         file_pf.close() 
         break 
        elif line[i-1] == "G": 
         file_pg = open("D:\\mydirectory\\file_pg.txt","a") 
         file_pg.write(line) 
         file_pg.close() 
         break 
      i = i + 1 
file.close() 

谢谢!

Alberto

+0

'line.split( '|')[3]'应该给你 'P' 或 'G' 为每一行。打开和关闭每个写入的输出文件也非常昂贵。在开始时打开它们,并在最后关闭它们。如果你担心异常,那么使用'closing'上下文管理器。 – PaulMcG

回答

0

打开和关闭文件操作相对较慢。如果可能,您应该只打开和关闭一次文件。在你的情况下,你可以将p和g行存储在列表中,然后在循环结束后立即写入所有行。

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 


p_lines = [] 
g_lines = [] 
i = 0 
with file: 
    for line in file: 
     i = 0 
     c = 0 
     while i < len(line): 
      carattere = line[i] 
      if carattere == "|": 
       c = c + 1 
       if c == 4: 
        if line[i-1] == "P": 
         p_lines.append(line) 
         break 
        elif line[i-1] == "G": 
         g_lines.append(line) 
         break 
      i = i + 1 
file.close() 

file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pf.writelines(p_lines) 
file_pf.close() 

file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pg.writelines(g_lines) 
file_pg.close() 

您还可以通过使用split更容易地识别每行中字段的内容。

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 


p_lines = [] 
g_lines = [] 
with file: 
    for line in file: 
     fields = line.split("|") 
     if fields[3] == "P": 
      p_lines.append(line) 
     elif fields[3] == "G": 
      g_lines.append(line) 
file.close() 

file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pf.writelines(p_lines) 
file_pf.close() 

file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pg.writelines(g_lines) 
file_pg.close() 

顺便说,严格来说,你并不需要使用with明确关闭该文件一旦你用它做。你可以做一个或另一个。在脚本开始时不需要打开并立即关闭file_pffile_pg

p_lines = [] 
g_lines = [] 
with open('D:\\mydirectory\\soggetti.txt','r') as file: 
    for line in file: 
     fields = line.split("|") 
     if fields[3] == "P": 
      p_lines.append(line) 
     elif fields[3] == "G": 
      g_lines.append(line) 

file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pf.writelines(p_lines) 
file_pf.close() 

file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pg.writelines(g_lines) 
file_pg.close() 

如果您想拥有比“P”和未来的“g”等多个线路类型,它可以为您节省一些时间,各种线条的存储在词典:

from collections import defaultdict 
lines_to_write = defaultdict(list) 
with file as open('D:\\mydirectory\\soggetti.txt','r'): 
    for line in file: 
     fields = line.split("|") 
     lineType = fields[3].lower() 
     lines_to_write[lineType].append(line) 

for lineType, lines in lines_to_write.iteritems(): 
    filename = "D:\\mydirectory\\file_{}f.txt".format(lineType) 
    with file as open(filename,"w"): 
     file.writelines(lines) 

您可以通过跟踪您所在的行号并定期打印消息来向用户报告已处理了多少行。

how_often_to_report = 100 #prints message every one hundred lines 
with file as open('D:\\mydirectory\\soggetti.txt','r'): 
    for line_number, line in enumerate(file): 
     if line_number % how_often_to_report == 0: 
      print "{} lines processed", line_number 
     #do rest of processing work here 
+0

当proc执行时可以插入一个计数器来查看处理的记录吗? – user2867049

+0

是的,您可以使用'enumerate'确定通过跟踪当前行号处理的记录数。编辑。 – Kevin

0
Read line from file 
split on | 
P = empty list 
G = empty list 
if splitted_line[index] is equal to P 
add line to P 
elif splitted_line[index] is equal to G 
add line to G 
open file for P 
write all lines in P 
close file for P 
open file for G 
write all lines in G 
close file for G 
1

我会去:

with open('D:\\mydirectory\\soggetti.txt','r') as source_file: 
    with open("D:\\mydirectory\\file_pf.txt","w") as file_pf: 
     with open("D:\\mydirectory\\file_pg.txt","w") as file_pg: 

      for line in source_file: 
       if line.split("|")[3] == "P": 
        file_pf.write(line) 
       elif line.split("|")[3] == "G": 
        file_pg.write(line) 

如果你所关心的速度,它可能是更好的事情可做:

with open('D:\\mydirectory\\soggetti.txt','r') as source_file: 
    listP = [] 
    listG = []   
    for line in source_file: 
     char = line.split("|")[3] 
     if char == "P": 
      listP.append(line) 
      file_pf.write(line) 
     elif char == "G": 
      listG.append(line) 
      file_pg.write(line) 

with open("D:\\mydirectory\\file_pf.txt","w") as file_pf: 
    for line in listP 
     file_pf.write(line) 

with open("D:\\mydirectory\\file_pg.txt","w") as file_pg: 
    for line in listG 
     file_pg.write(line) 
0

我没有测试这个,但下面的东西应该更快

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","a") 
file_pg = open("D:\\mydirectory\\file_pg.txt","a") 

for line in file: 
    bits = line.split("|") 
    if bits[3] == "P": 
     file_pf.write(line) 
    if bits[3] == "G": 
     file_pg.write(line) 


file.close() 
file_pf.close() 
file_pg.close() 
0

下面的代码应该比你在做什么更快,因为。

  1. 你没有循环遍历每一个字符。
  2. 您不必每次写入都打开文件。
  3. 如果要评估的条件较少。

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 


file_pf = open("D:\\mydirectory\\file_pf.txt","a") 
file_pg = open("D:\\mydirectory\\file_pg.txt","a") 
with file: 
    for line in file: 
     switch = line.split('|')[3] 
     write = file_pf.write if 'P' in switch else file_pg.write 
     write(line) 

file_pg.close() 
file_pf.cloe() 
file.close() 
+0

我相信你需要在你的'write = ...'行中省略括号,否则'write'不会引用你想要的函数对象。 – Kevin