Python：文件编码错误

从几天开始，我在Python中的小程序中用文件编码来解决这个烦人的问题。Python：文件编码错误

我使用MediaWiki工作了很多 - 最近我做了从.doc到Wikisource的文档转换。

Microsoft Word格式的文档在Libre Office中打开，然后导出为带有Wikisource格式的.txt文件。我的程序正在搜索一个[[Image：]]标签，并将其替换为从列表中获取的图像名称 - 并且该机制非常有效（非常感谢brjaga！）。当我做了我创建的.txt文件的一些测试一切工作就好了，但是当我把一个.txt文件与维基整个事情并不那么好笑了：d

我得到这个消息舞会的Python：

Traceback (most recent call last): 
    File "C:\Python33\final.py", line 15, in <module> 
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()]) 
    File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode 
    return codecs.charmap_decode(input,self.errors,decoding_table)[0] 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>

这是我的Python代码：

li = [ 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]", 
    "[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]" 
    ] 


with open ("C:\\124_BPP_PL_PL.txt") as myfile: 
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()]) 

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w') 

for item in li: 
    s = s.replace("[[Image:]]", item, 1) 

dest.write(s) 
dest.close()

OK，所以我做了一些研究，发现这是利用编码问题。于是我安装了一个Notepad ++程序，并用Wikisource将我的.txt文件的编码更改为：UTF-8并保存。然后，我做我的代码中的一些变化：

with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile: 
     s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

但我得到这个新的错误消息：

Traceback (most recent call last): 
    File "C:\Python33\final.py", line 22, in <module> 
    dest.write(s) 
    File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode 
    return codecs.charmap_encode(input,self.errors,encoding_table)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

，我真的卡在这一个。我想，当我在Notepad ++中手动更改编码，然后我会告诉我设置的编码 - 一切都会很好。

请帮助，提前谢谢。

来源

2013-11-23 exxon

什么编解码器没有记事本+觉得它是当你打开输入文件？为什么你不使用该编码在Python中读取文件（而不是将其更改为UTF-8）？ –

嗨，编解码器是“ANSI作为UTF-8” - 我不知道这意味着什么，我不知道如何设置这个编解码器在Python open（）函数，你知道这是什么？以及如何在Python中设置它？ – exxon

'UTF-8'很好;没有ANSI编解码器，实际上，它只是'本地Windows代码页方言'，它可以是'cp1250'和'cp1255' IIRC之间的任何东西。 –

当Python 3打开一个文本文件时，它在尝试解码该文件以便为您提供完整的Unicode文本（str类型完全可识别Unicode）时使用系统的默认编码。写出这样的Unicode文本值时也是这样。

你已经解决了输入端;您在阅读时指定了一种编码。当正在编写时，请执行相同操作：指定一个编解码器用于写出可以处理Unicode的文件，包括代码点U + FEFF处的非中断空白字符。 UTF-8通常是一个不错的缺省选择：

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')

写太多时，您可以使用with声明并保存自己的.close()电话：

for item in li: 
    s = s.replace("[[Image:]]", item, 1) 

with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:   
    dest.write(s)

来源

2013-11-23 16:16:26

谢谢！它工作完美！ – exxon

Python：文件编码错误

回答

相关问题