我通常使用下面的Python代码从文件中读取行:Python - 如何读取具有NUL分隔线的文件?
f = open('./my.csv', 'r')
for line in f:
print line
但怎么样,如果该文件是由行“\ 0”(而不是“\ n”)分隔?有没有可以处理这个问题的Python模块?
感谢您的任何建议。
我通常使用下面的Python代码从文件中读取行:Python - 如何读取具有NUL分隔线的文件?
f = open('./my.csv', 'r')
for line in f:
print line
但怎么样,如果该文件是由行“\ 0”(而不是“\ n”)分隔?有没有可以处理这个问题的Python模块?
感谢您的任何建议。
如果您的文件足够小,你可以阅读所有到内存,你可以使用拆分:
for line in f.read().split('\0'):
print line
否则,你可能会想尝试这个feature request从讨论这个食谱:
def fileLineIter(inputFile,
inputNewline="\n",
outputNewline=None,
readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
if outputNewline is None: outputNewline = inputNewline
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
partialLine += charsJustRead
lines = partialLine.split(inputNewline)
partialLine = lines.pop()
for line in lines: yield line + outputNewline
if partialLine: yield partialLine
我也注意到你的文件有一个“csv”扩展名。 Python中内置了一个CSV模块(import csv)。有一个叫Dialect.lineterminator
但它是目前没有在阅读器实现的属性:
Dialect.lineterminator
用于终止由作家生产线的字符串。它默认为'\ r \ n'。
注意阅读器是硬编码,可识别'\ r'或'\ n'作为行结束符,并忽略行终结符。这种行为在未来可能会改变。
我修改了Mark Byers的建议,以便我们可以在Python中使用带NUL分隔线的READLINE文件。这种方法逐行读取一个潜在的大文件,应该更具有内存效率。这里是Python代码(带注释):
import sys
# Variables for "fileReadLine()"
inputFile = sys.stdin # The input file. Use "stdin" as an example for receiving data from pipe.
lines = [] # Extracted complete lines (delimited with "inputNewline").
partialLine = '' # Extracted last non-complete partial line.
inputNewline="\0" # Newline character(s) in input file.
outputNewline="\n" # Newline character(s) in output lines.
readSize=8192 # Size of read buffer.
# End - Variables for "fileReadLine()"
# This function reads NUL delimited lines sequentially and is memory efficient.
def fileReadLine():
"""Like the normal file readline but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the read lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
# Declare that we want to use these related global variables.
global inputFile, partialLine, lines, inputNewline, outputNewline, readSize
if lines:
# If there is already extracted complete lines, pop 1st llne from lines and return that line + outputNewline.
line = lines.pop(0)
return line + outputNewline
# If there is NO already extracted complete lines, try to read more from input file.
while True: # Here "lines" must be an empty list.
charsJustRead = inputFile.read(readSize) # The read buffer size, "readSize", could be changed as you like.
if not charsJustRead:
# Have reached EOF.
if partialLine:
# If partialLine is not empty here, treat it as a complete line and copy and return it.
popedPartialLine = partialLine
partialLine = "" # partialLine is now copied for return, reset it to an empty string to indicate that there is no more partialLine to return in later "fileReadLine" attempt.
return popedPartialLine # This should be the last line of input file.
else:
# If reached EOF and partialLine is empty, then all the lines in input file must have been read. Return None to indicate this.
return None
partialLine += charsJustRead # If read buffer is not empty, add it to partialLine.
lines = partialLine.split(inputNewline) # Split partialLine to get some complete lines.
partialLine = lines.pop() # The last item of lines may not be a complete line, move it to partialLine.
if not lines:
# Empty "lines" means that we must NOT have finished read any complete line. So continue.
continue
else:
# We must have finished read at least 1 complete llne. So pop 1st llne from lines and return that line + outputNewline (exit while loop).
line = lines.pop(0)
return line + outputNewline
# As an example, read NUL delimited lines from "stdin" and print them out (using "\n" to delimit output lines).
while True:
line = fileReadLine()
if line is None: break
sys.stdout.write(line) # "write" does not include "\n".
sys.stdout.flush()
希望它有帮助。
我的文件将是几千到几万行。 – user1129812 2012-02-11 02:22:25
@ user1129812:一条线有多长? 100字节? 100字节* 50000行= 5MB – 2012-02-11 02:27:51
每行应该是大约100个字符。假设unicode,每行将约为200字节,50000行文件约200 x 50000 = 9.54 MB。 – user1129812 2012-02-11 02:37:26