如何许可解码UTF-8字节数组？

我需要将存储在字节数组中的UTF-8序列解码为字符串。如何许可解码UTF-8字节数组？

UTF-8序列可能包含错误的部分。在这种情况下，我需要尽可能地解码，并且（可选地）用诸如“？”之类的替换无效部分。

# First part decodes to "ABÄC" 
b = bytearray([0x41, 0x42, 0xC3, 0x84, 0x43]) 
s = str(b, "utf-8") 
print(s) 

# Second part, invalid sequence, wanted to decode to something like "AB?C" 
b = bytearray([0x41, 0x42, 0xC3, 0x43]) 
s = str(b, "utf-8") 
print(s)

在Python 3中实现这一点的最好方法是什么？

来源

2017-01-04 Joe

有几种内置错误处理schemes用于编码和解码str和从bytes和bytearray用例如bytearray.decode()。例如：

>>> b = bytearray([0x41, 0x42, 0xC3, 0x43])

>>> b.decode('utf8', errors='ignore') # discard malformed bytes 
'ABC'

>>> b.decode('utf8', errors='replace') # replace with U+FFFD 
'AB�C'

>>> b.decode('utf8', errors='backslashreplace') # replace with backslash-escape 
'AB\\xc3C'

此外，您可以编写自己的错误处理程序和register它：

import codecs 

def my_handler(exception): 
    """Replace unexpected bytes with '?'.""" 
    return '?', exception.end 

codecs.register_error('my_handler', my_handler)

>>> b.decode('utf8', errors='my_handler') 
'AB?C'

所有这些错误处理方案，也可与str()构造用作你的问题：

>>> str(b, 'utf8', errors='my_handler') 
'AB?C'

...虽然这是更地道明确使用str.decode()。

来源

2017-01-04 12:34:25

如何许可解码UTF-8字节数组？

回答

相关问题