尝试一行在时间上的文件工作:
lowered = []
with open('tweets.txt', 'r') as handle:
for line in handle:
# keep accumulating the results ...
lowered.append(line.lower())
# or just dump the to stdout right away
print(line)
for line in lowered:
# print or write to file or whatever you require
这样,你降低了内存开销,其中,对于大文件的情况下可能会导致交换和杀死性能。
这里有一个文件中的一些基准测试与约1M线路:
# (1) real 0.223 user 0.195 sys 0.026 pcpu 98.71
with open('medium.txt') as handle:
for line in handle:
pass
# (2) real 0.295 user 0.262 sys 0.025 pcpu 97.21
with open('medium.txt') as handle:
for i, line in enumerate(handle):
pass
print(i) # 1031124
# (3) real 21.561 user 5.072 sys 3.530 pcpu 39.89
with open('medium.txt') as handle:
for i, line in enumerate(handle):
print(line.lower())
# (4) real 1.702 user 1.605 sys 0.089 pcpu 99.50
lowered = []
with open('medium.txt') as handle:
for i, line in enumerate(handle):
lowered.append(line.lower())
# (5) real 2.307 user 1.983 sys 0.159 pcpu 92.89
lowered = []
with open('medium.txt', 'r') as handle:
for i, line in enumerate(handle):
lowered.append(line.lower())
with open('lowered.txt', 'w') as handle:
for line in lowered:
handle.write(line)
你也可以迭代超过两个文件一次:
# (6) real 1.944 user 1.666 sys 0.115 pcpu 91.59
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink:
for i, line in enumerate(src):
sink.write(line.lower())
结果如表:
# (1) noop 0.223
# (2) w/ enumerate 0.295
# (4) list buffer 1.702
# (6) on-the-fly 1.944
# (5) r -> list buffer -> w 2.307
# (3) stdout print 21.561
这可能会有所帮助:http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python –