2013-06-19 113 views
4

我试图用Python读入.xls文件。该文件包含多个非ASCII字符(即,äöü)。我尝试过使用openpyxls和xlrd(我对xlrd寄予厚望,因为它应该读取unicode中的所有内容),但都没有工作。从Python的xls读取unicode

我已经找到了多个答案处理编码/解码,同时试图从xls打印信息,但我似乎甚至不能得到那么多。

import xlrd 
workbook = xlrd.open_workbook('export_data.xls') 

结果造成:

Traceback (most recent call last): 
    File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module> 
    workbook = xlrd.open_workbook('export_data.xls') 
    File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook 
    ragged_rows=ragged_rows, 
    File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls 
    bk.get_sheets() 
    File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets 
    self.get_sheet(sheetno) 
    File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet 
    sh.read(self) 
    File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read 
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2) 
    File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string 
    return unicode(data[pos:pos+nchars], encoding) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 55: ordinal not in range(128) 
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero 
*** No CODEPAGE record, no encoding_override: will use 'ascii' 
*** No CODEPAGE record, no encoding_override: will use 'ascii' 

我也试过:

workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8") 

导致:

Traceback (most recent call last): 
    File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module> 
    workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8") 
    File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook 
    ragged_rows=ragged_rows, 
    File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls 
    bk.get_sheets() 
    File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets 
    self.get_sheet(sheetno) 
    File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet 
    sh.read(self) 
    File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read 
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2) 
    File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string 
    return unicode(data[pos:pos+nchars], encoding) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 55: invalid start byte 
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero 
这纸条后立即只是试图读取文件出错了

并包括在顶部var ios版本:

# -*- coding: utf-8 -*- 

我在Windows Server 2008机器上的python 2.7上运行此操作。

回答

0

我从二OOO文档的阅读,XLS采用了统一的utf_16_le味道,不是UTF8(即它使用每个字符都是2个字节存储小端),所以尝试:

workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf_16_le") 

(见http://www.openoffice.org/sc/excelfileformat.pdf的第17页)

1

谢谢大家的反馈!

我最终使用encoding_override函数修复了它。我无法找到哪些cp代码对应于德语字符的Microsoft文档,所以我试了一下。最终我得到了CP1251,它的工作!

workbook = xlrd.open_workbook(path, encoding_override="cp1251")