2016-03-24 142 views
0

我试图删除一个约3000万行的文本文件中的特定行(10884121)。这是我第一次尝试的方法,但是,当我执行它时运行了大约20秒,然后给我一个“内存错误”。有一个更好的方法吗?谢谢!Python删除一个特定的行号

import fileinput 
import sys 

f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt' 
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt' 

with open(f_in, 'r') as fin: 
    with open(f_out, 'w') as fout: 
     linenums = [10884121] 
     s = [y for x, y in enumerate(fin) if x not in [line - 1 for line in linenums]] 
     fin.seek(0) 
     fin.write(''.join(s)) 
     fin.truncate(fin.tell()) 
+1

不要用'枚举(FIN)'和'fin.write将整个文件读入内存( ''。加入(S))' –

回答

1

首先,你没有使用进口;您正在尝试写入输入文件,并且您的代码将整个文件读入内存。

像这样的东西可能会减少麻烦 - 我们逐行阅读, 使用enumerate来计算行号;和每行,我们把它写入输出,如果它的编号是忽视的行列表:

f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt' 
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt' 

ignored_lines = [10884121] 
with open(f_in, 'r') as fin, open(f_out, 'w') as fout: 
    for lineno, line in enumerate(fin, 1): 
     if lineno not in ignored_lines: 
      fout.write(line) 
+0

感谢您的帮助!我是python的新手,所以仍然在学习如何运作。 – lsch91

0

请尝试使用:

import fileinput 

f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt' 
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt' 

f = open(f_out,'w') 

counter=0 

for line in fileinput.input([f_in]): 
    counter=counter+1 
    if counter != 10884121: 
      f.write(line) # python will convert \n to os.linesep, maybe you need to add a os.linesep, check 

f.close() # you can omit in most cases as the destructor will call it 
0

还有,你用完了高机会内存,因为你正试图将文件存储到列表中。 试试这个如下:

import fileinput 
import sys 

f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt' 
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt' 
_fileOne = open(f_in,'r') 
_fileTwo = open(f_out,'w') 
linenums = set([10884121]) 
for lineNumber, line in enumerate(_fileOne): 
    if lineNumber not in linenums: 
     _fileTwo.writeLine(line) 
_fileOne.close() 
_fileTwo.close() 

在这里,我们逐行读取文件中的行,并排除一些并不需要的线路,这可能不会耗尽内存。 您也可以尝试使用缓冲读取文件。 希望这有助于。

0

通用文件过滤功能如何?

def file_filter(file_path, condition=None): 
    """Yield lines from a file if condition(n, line) is true. 
     The condition parameter is a callback that receives two 
     parameters: the line number (first line is 1) and the 
     line content.""" 

    if condition is None: 
     condition = lambda n, line: True 

    with open(file_path) as source: 
     for n, line in enumerate(source): 
      if condition(n + 1, line): 
       yield line 

open(f_out, 'w') as destination: 
    condition = lambda n, line: n != 10884121 

    for line in file_filter(f_in, condition): 
     destination.write(line)