的UnicodeDecodeError：“UTF-8”编解码器不能在位置0解码字节0x80的：无效的起始字节

我知道有很多关于中的编码解码的问题，但我似乎无法弄清楚了这一点：的UnicodeDecodeError：“UTF-8”编解码器不能在位置0解码字节0x80的：无效的起始字节

def content(title, sents): 
sent_elems = [] 
for sent_i, sent in enumerate(sents, 1): 


    elem = u"<a name=\"{i}\">[{i}]</a> <a href=\"#{i}\" id={i}>{text}</a>".format(i=sent_i, text=sent.text) 
    sent_elems.append(elem) 
doc = u"""<html> 
<head> 
<title>{title}</title> 
</head> 
<body>{elems}</body> 
</html>""".format(title=title, elems="\n".join(sent_elems)) 

return doc

调用内容功能会给我这个错误在非常罕见的情况下（在我的整个数据集，也许一两次）：

File "processing.py", line 68, in score_summary 
self._write_config(references, summary) 
    File "processing.py", line 56, in _write_config 
reference_files = self._write_references(references, reference_dir) 
    File "processing.py", line 44, in _write_references 
f.write(rouge_summary_content(reference.id, reference.sents)) 
    File "processing.py", line 154, in rouge_summary_content 
</html>""".format(title=title, elems="\n".join(sent_elems)) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

我已经改变：

sent_elems.append(elem.decode("utf-8", "ignore"))

也

sent_elems.append(elem.decode("utf-8", "replace"))

还是同样的错误。

我看了一下数据，却无法弄清楚为什么会发生这种情况。我检查了这个错误发生的文件，仍然没有非utf8字符。

我也是在我的文件中添加了这个：

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")

问题仍然是存在的。有什么建议么？

来源

2014-10-01 user3430235

不要**使用'sys.setdefaultencoding（）'。这类似于绑定一条断腿并继续前进，而不是去ER去设置一个阵容。东西仍然破损，你会在稍后感觉到疼痛，并且必须重置骨骼。 – 2014-10-01 21:01:42

这很可能是你的'title'是字节，而不是unicode。 – 2014-10-01 21:02:51

这会造成更多的问题。通过设置sys.setdefaultencoding（“utf-8”），我跳过了几乎所有的编码解码错误。我需要摆脱或知道其来源的持续性案例很少。 – user3430235 2014-10-01 21:04:58

我的标题是chr(65+index)，所以当它结束所有大写字母时，它会打印一些非UTF-8字符。我将它改为str(index)，它解决了我原来的问题。

来源

2014-10-01 21:27:04 user3430235

不幸的是，这个问题并没有解决。我有另一个错误。 – user3430235 2014-10-02 16:58:39

如果您的数据看起来像下面给出的一个：

data="0\x80\x06\t*\x86H\x86\xf7\r\x01\x07\x04\xa0\x800\x80\x02\x01\x01\x0e0\x0c\x06\b*\x86H\x86\xf7\r\x02\x05\x05....."

遵循下面的方法，我们可以把它在UTF8解码

encoded = base64.b64encode(data) 
decoded = urllib.unquote(encoded).decode('utf8')

其结果将是像这样：

MIAGCSqGSIb3DQEHAq...

来源

2016-10-11 09:52:19 vijay

的UnicodeDecodeError：“UTF-8”编解码器不能在位置0解码字节0x80的：无效的起始字节

回答

相关问题