2013-01-04 228 views
0

我有一个包含近100000行的文件。我想做一个清理过程(小写,删除停用词等),但它需要时间。使用python从文件中读取行

万用脚本需要15分钟的示例。对于所有文件,我预计需要150分钟。然而它需要5个小时。

在开始阅读本文件使用:

fileinput = open('tweets.txt', 'r') 

lines = fileinput.read().lower() #for lower case, however it load all file 

for line in fileinput: 
    lines = line.lower() 

问:我可以用一种方法来读取第10000行做清洗和线路等,阅读下一篇博客之后的过程?

+1

这可能会有所帮助:http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python –

回答

0

更改您的脚本如下:

with open('tweets.txt', 'r') as fileinput: 
    for line in fileinput: 
    """do what you need to do with each line""" 
    line = line.lower() 

所以,基本上,不要在整个文件到使用read()存储器读,只是遍历打开的文件的行。当你将一个巨大的文件读入内存时,你的进程可能会增长到系统需要将部分内存换出的地步,这会使其非常缓慢。

+0

有没有理由使用'.readlines()' - 你可以迭代文件对象本身。 – Amber

+0

@Amber右边,更正 – piokuc

2

我会强烈建议逐行操作而不是一次读取整个文件(换句话说,不要使用.read())。

with open('tweets.txt', 'r') as fileinput: 
    for line in fileinput: 
     line = line.lower() 
     # ... do something with line ... 
     # (for example, write the line to a new file, or print it) 

This will automatically take advantage of Python's built-in buffering capabilities

+0

用这个我为每一行制作过程。这可能需要更多时间吗? –

+0

取决于过程。在平均情况下,使用文件缓冲保存时,额外函数调用的任何额外时间将超过补偿时间。 – Amber

1

尝试一行在时间上的文件工作:

lowered = []  

with open('tweets.txt', 'r') as handle: 
    for line in handle: 
     # keep accumulating the results ... 
     lowered.append(line.lower()) 
     # or just dump the to stdout right away 
     print(line) 

for line in lowered: 
    # print or write to file or whatever you require 

这样,你降低了内存开销,其中,对于大文件的情况下可能会导致交换和杀死性能。

这里有一个文件中的一些基准测试与约1M线路:

# (1) real 0.223 user 0.195 sys 0.026 pcpu 98.71 
with open('medium.txt') as handle: 
    for line in handle: 
     pass 

# (2) real 0.295 user 0.262 sys 0.025 pcpu 97.21 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     pass 
    print(i) # 1031124 

# (3) real 21.561 user 5.072 sys 3.530 pcpu 39.89 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     print(line.lower()) 

# (4) real 1.702 user 1.605 sys 0.089 pcpu 99.50 
lowered = [] 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     lowered.append(line.lower()) 

# (5) real 2.307 user 1.983 sys 0.159 pcpu 92.89 
lowered = [] 
with open('medium.txt', 'r') as handle: 
    for i, line in enumerate(handle): 
     lowered.append(line.lower()) 

with open('lowered.txt', 'w') as handle: 
    for line in lowered: 
     handle.write(line) 

你也可以迭代超过两个文件一次:

# (6) real 1.944 user 1.666 sys 0.115 pcpu 91.59 
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink: 
    for i, line in enumerate(src): 
     sink.write(line.lower()) 

结果如表:

# (1) noop     0.223 
# (2) w/ enumerate   0.295 
# (4) list buffer   1.702 
# (6) on-the-fly    1.944 
# (5) r -> list buffer -> w 2.307 
# (3) stdout print   21.561 
+0

更好的办法是写出或打印行,因为它们处理,所以你不必缓冲内存中的整个处理行的列表。 – Amber

+0

@Amber,是的,我加了一张纸条。 – miku