使python默认使用字符串替换不可编码的字符

我想让python忽略不能编码的字符，只需用字符串"<could not encode>"替换即可。使python默认使用字符串替换不可编码的字符

E.g，假设默认编码是ASCII，命令

'%s is the word'%'ébác'

会产生

'<could not encode>b<could not encode>c is the word'

有什么办法，使这个默认的行为，在所有我的项目？

来源

2009-12-19 olamundo

如果默认编码是ascii，那么''ébác''字符串的编码是什么？ –

@Peter Hansen - 你是对的:)它只是解释我想要的......不好的例子。 – olamundo

的str.encode函数采用限定所述错误处理的可选参数：

str.encode([encoding[, errors]])

从文档：

返回字符串的编码版本。默认编码是当前的默认字符串编码。可能会给出错误来设置不同的错误处理方案。错误的默认值是'strict'，这意味着编码错误会引发UnicodeError。其他可能的值有'ignore'，'replace'，'xmlcharrefreplace'，'backslashreplace'以及通过codecs.register_error（）注册的任何其他名称，请参见编解码器基类。有关可能的编码列表，请参见标准编码部分。

在你的情况下，codecs.register_error函数可能是感兴趣的。

[备注坏字符]

顺便说一句，请注意使用register_error时，你可能会发现自己与你的字符串替换不只是个别坏人的角色，但连续的坏字符组，除非你支付注意。每次运行不好的字符都会得到一个错误处理程序的调用，而不是每个字符。

来源

2009-12-19 15:39:20 miku

在[这个Python测试文件]（https://github.com/python-git/python/blob/master/Lib/test/test_codeccallbacks.py）中有一些如何使用'codecs.register_error'的例子。 –

>>> help("".encode) 
Help on built-in function encode: 

encode(...) 
S.encode([encoding[,errors]]) -> object 

Encodes S using the codec registered for encoding. encoding defaults 
to the default encoding. errors may be given to set a different error 
handling scheme. Default is 'strict' meaning that encoding errors raise 
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and 
'xmlcharrefreplace' as well as any other name registered with 
codecs.register_error that is able to handle UnicodeEncodeErrors.

所以，举例来说：

>>> x 
'\xc3\xa9b\xc3\xa1c is the word' 
>>> x.decode("ascii") 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) 
>>> x.decode("ascii", "replace") 
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

添加您自己的回调codecs.register_error与您所选择的字符串替换。

来源

2009-12-19 15:45:28

使python默认使用字符串替换不可编码的字符

回答

相关问题