使用xlrd打开Excel文件时出现编码错误

我想使用xlrd打开Excel文件（.xls）。这是我正在使用的代码摘要：使用xlrd打开Excel文件时出现编码错误

import xlrd 
workbook = xlrd.open_workbook('thefile.xls')

这适用于大多数文件，但对于从特定组织获得的文件会失败。下面是我尝试从此组织打开Excel文件时遇到的错误。

Traceback (most recent call last): 
    File "<console>", line 1, in <module> 
    File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook 
    ragged_rows=ragged_rows, 
    File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls 
    bk.parse_globals() 
    File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals 
    self.handle_writeaccess(data) 
    File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess 
    strg = unpack_unicode(data, 0, lenlen=2) 
    File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode 
    strg = unicode(rawstrg, 'utf_16_le') 
    File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode 
    return codecs.utf_16_le_decode(input, errors, True) 
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data

这看起来好像xlrd试图打开比UTF-16以外的其他编码的Excel文件。我怎样才能避免这个错误？该文件是以错误的方式编写的，还是仅存在导致该问题的特定字符？如果我打开并重新保存Excel文件，xlrd将打开文件而不会出现问题。

我曾尝试用不同的编码覆盖打开工作簿，但这也不起作用。

我尝试打开该文件，请访问：

https://dl.dropboxusercontent.com/u/6779408/Stackoverflow/AEPUsageHistoryDetail_RequestID_00183816.xls

问题这里报告：https://github.com/python-excel/xlrd/issues/128

来源

2015-02-05 Erik

什么是他们用来生成该文件？

他们正在使用某些Java Excel API（请参见下文中的link here），可能是在IBM大型机或类似软件上。

从堆栈跟踪中，由于@字符，writeaccess信息无法解码为Unicode。请参阅5.112 WRITEACCESS或Page 277。

此字段包含已保存文件的用户的用户名。

import xlrd 
dump = xlrd.dump('thefile.xls')

原始文件上运行xlrd.dump给

36: 005c WRITEACCESS len = 0070 (112) 
    40:  d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40 [email protected][email protected][email protected]@ 
    56:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@ 
    72:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@ 
    88:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@ 
    104:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@ 
    120:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@ 
    136:  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@

与Excel或在我的情况的LibreOffice Calc中重新保存它的写访问的信息后，用东西覆盖像

36: 005c WRITEACCESS len = 0070 (112) 
40:  04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20 ?~~Calc   
56:  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20     
72:  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20     
88:  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20     
104:  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20     
120:  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20     
136:  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

基于编码为40的空格，我相信编码是EBCDIC，当我们将d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40转换为EBCDIC时，我们得到Java Excel API。

所以，在BIFF8和更高版本中，文件是以有缺陷的方式写入的，它应该是一个unicode字符串，而在BIFF3到BIFF5中，它应该是CODEPAGE信息中编码中的字节字符串，它是

152: 0042 CODEPAGE len = 0002 (2) 
156:  12 52           ?R

1252的Windows CP-1252（拉丁文I）（BIFF4-BIFF5），这是不EBCDIC_037。

xlrd试图使用unicode的事实意味着它确定文件的版本为BIFF8。

在这种情况下，你有两个选择

与xlrd打开之前修复该文件。您可以使用dump来检查未标准化的文件，然后如果是这种情况，可以使用xlutils.save或其他库覆盖writeaccess信息。
修补程序xlrd来处理您的特例，在handle_writeaccess中添加一个try块，并在unpack_unicode失败时将strg设置为空字符串。

下面的代码片段

def handle_writeaccess(self, data): 
     DEBUG = 0 
     if self.biff_version < 80: 
      if not self.encoding: 
       self.raw_user_name = True 
       self.user_name = data 
       return 
      strg = unpack_string(data, 0, self.encoding, lenlen=1) 
     else: 
      try: 
       strg = unpack_unicode(data, 0, lenlen=2) 
      except: 
       strg = "" 
     if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg) 
     strg = strg.rstrip() 
     self.user_name = strg

与

workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")

似乎已成功打开该文件。

没有编码覆盖它抱怨ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

来源

2015-02-07 11:18:37 Appleman1234

我不知道什么组织用来写Excel文件，但我对他们有一个问题这一点。我会尝试你的第二个选择，因为你说它已经为你工作，并在这里发布我的结果。我很有希望 - 谢谢你的出色反应。 – Erik

这很好。谢谢。 – Erik

我无法奖励另外15个小时的奖励 - 但我会这么做。再次感谢。 – Erik

使用xlrd打开Excel文件时出现编码错误

回答

相关问题