2017-05-15 64 views
1

刚开始我的方式进入python,我无法绕过基本的文件导航方法。为什么file_object.tell()为不同位置的文件提供相同的字节?

当我阅读tell()教程时,它指出它返回我当前坐在我的文件上的位置(以字节为单位)。

我的推理是该文件的每个字符将加起来的字节坐标,对不对?这意味着在一个新行后面,这个字符串只是在字符串上分割的一串字符,我的字节坐标会改变......但是这似乎是不正确的。

我产生巴蜀

$ for i in {1..10}; do echo "@ this is the "$i"th line" ; done > toy.txt 
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt 

快速玩具文本文件,现在我将通过这个文件迭代并打印出的行号,并在每个周期中,tell()调用的结果。 @是为了标记划定文件块的一些行,我想返回(参见下文)。

我的猜测是,for循环遍历文件对象第一个,达到它的结束,因此它始终保持不变。

这是一个玩具的例子,对于我真正的问题,该文件的长度为Gigs,并且通过应用相同的方法,我得到了tell()的结果,其中我反映了如何通过for循环遍历文件对象。 这是正确的吗?你能否谈谈我错过的概念?

我最终的目标是能够找到文件中的特定坐标,然后并行处理这些来自分布式起点的大文件,这些文件无法以我筛选的方式进行监视。

os.path.getsize("toy.txt") 
451 

fa = open("toy.txt") 
fa.seek(0) # let's double check 
fa.tell() 
count = 0 
for line in fa: 
    if line.startswith("@"): 
     print line , 
     print "tell {} count {}".format(fa.tell(), count) 
    else: 
     if count < 32775: 
      print line, 
      print "tell {} count {}".format(fa.tell(), count) 
    count += 1 

输出:

@ this is the 1th line 
tell 451 count 0 
@ this is the 2th line 
tell 451 count 1 
@ this is the 3th line 
tell 451 count 2 
@ this is the 4th line 
tell 451 count 3 
@ this is the 5th line 
tell 451 count 4 
@ this is the 6th line 
tell 451 count 5 
@ this is the 7th line 
tell 451 count 6 
@ this is the 8th line 
tell 451 count 7 
@ this is the 9th line 
tell 451 count 8 
@ this is the 10th line 
tell 451 count 9 
this is the 11th line 
tell 451 count 10 
this is the 12th line 
tell 451 count 11 
this is the 13th line 
tell 451 count 12 
this is the 14th line 
tell 451 count 13 
this is the 15th line 
tell 451 count 14 
this is the 16th line 
tell 451 count 15 
this is the 17th line 
tell 451 count 16 
this is the 18th line 
tell 451 count 17 
this is the 19th line 
tell 451 count 18 
this is the 20th line 
tell 451 count 19 

回答

3

您使用的是for循环来逐行读取文件中的行:

for line in fa: 

文件通常不这样做;你读blob的数据,通常是块。为了使Python为您提供行,您需要阅读,直到下一个换行符。只有逐字节读取才能找到换行符,效率不高。因此使用缓冲区;你读了一个大块,然后在该块中找到换行符并为每个找到的行生成一行。一旦缓冲区耗尽,你就读一个新的块。

您的文件不够大,无法读取多个块;它只有451个字节很小,而缓冲区通常以千字节为单位。如果你要创建一个更大的文件,当迭代时,你会看到文件位置跳跃很大。

file.next documenationnext负责产生迭代的情况下,for循环做什么的下一行的方法):

为了使遍历行的for循环的最有效方法(非常常见的操作),next()方法使用隐藏的预读缓冲区。

如果你需要保持绝对的文件位置的轨迹,同时遍历行,你将不得不使用二进制模式如果在Windows上(防止换行符翻译发生),并跟踪的线长:

position = 0  
for line in fa: 
    position += len(line) 

另一种方法是使用io library;这是Python 3用于处理文件的框架。 file.tell()方法将缓冲区考虑在内,并且即使在迭代时也会生成准确的文件位置

考虑到,当你使用io.open()来打开文本模式一个文件,你会得到unicode字符串。在Python 2中,如果必须有str字节串,则可以使用二进制模式(以'rb'打开)。事实上,只有在二进制模式你会被给予访问IOBase.tell(),在文本模式将抛出一个异常:

>>> import io 
>>> fa = io.open("toy.txt") 
>>> next(fa) 
u'@ this is the 1th line\n' 
>>> fa.tell() 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
IOError: telling position disabled by next() call 

在二进制模式下,你会得到精确的输出为file.tell()

>>> import os.path 
>>> os.path.getsize("toy.txt") 
461 
>>> fa = io.open("toy.txt", 'rb') 
>>> for line in fa: 
...  if line.startswith("@"): 
...   print line , 
...   print "tell {} count {}".format(fa.tell(), count) 
...  else: 
...   if count < 32775: 
...    print line, 
...    print "tell {} count {}".format(fa.tell(), count) 
...  count += 1 
... 
@ this is the 1th line 
tell 23 count 0 
@ this is the 2th line 
tell 46 count 1 
@ this is the 3th line 
tell 69 count 2 
@ this is the 4th line 
tell 92 count 3 
@ this is the 5th line 
tell 115 count 4 
@ this is the 6th line 
tell 138 count 5 
@ this is the 7th line 
tell 161 count 6 
@ this is the 8th line 
tell 184 count 7 
@ this is the 9th line 
tell 207 count 8 
@ this is the 10th line 
tell 231 count 9 
this is the 11th line 
tell 254 count 10 
this is the 12th line 
tell 277 count 11 
this is the 13th line 
tell 300 count 12 
this is the 14th line 
tell 323 count 13 
this is the 15th line 
tell 346 count 14 
this is the 16th line 
tell 369 count 15 
this is the 17th line 
tell 392 count 16 
this is the 18th line 
tell 415 count 17 
this is the 19th line 
tell 438 count 18 
this is the 20th line 
tell 461 count 19 
相关问题