如何修复UnicodeDecodeError：'ascii'编解码器无法解码字节？

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

这是我在尝试清理使用spaCy从html页面提取的名称列表时遇到的错误。如何修复UnicodeDecodeError：'ascii'编解码器无法解码字节？

我的代码：

import urllib 
import requests 
from bs4 import BeautifulSoup 
import spacy 
from spacy.en import English 
from __future__ import unicode_literals 
nlp_toolkit = English() 
nlp = spacy.load('en') 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 

    # delete unwanted tags: 
    for s in soup(['figure', 'script', 'style']): 
     s.decompose() 

    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'story-body__inner'})] 

    text = ''.join(article_soup) 
    return text 

# using spacy 
def get_names(all_tags): 
    names=[] 
    for ent in all_tags.ents: 
     if ent.label_=="PERSON": 
      names.append(str(ent)) 
    return names 

def cleaning_names(names): 
    new_names = [s.strip("'s") for s in names] # remove 's' from names 
    myset = list(set(new_names)) #remove duplicates 
    return myset 

def main(): 
    url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
    text=get_text(url) 
    text=u"{}".format(text) 
    all_tags = nlp(text) 
    names = get_person(all_tags) 
    print "names:" 
    print names 
    mynewlist = cleaning_names(names) 
    print mynewlist 

if __name__ == '__main__': 
    main()

对于这个特定的网址，我得到的名字的名单，其中包括像£或字符$：

['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May']

然后错误：

Traceback (most recent call last) <ipython-input-19-8582e806c94a> in <module>() 
    47 
    48 if __name__ == '__main__': 
---> 49  main() 

<ipython-input-19-8582e806c94a> in main() 
    43  print "names:" 
    44  print names 
---> 45  mynewlist = cleaning_names(names) 
    46  print mynewlist 
    47 

<ipython-input-19-8582e806c94a> in cleaning_names(names) 
    31 
    32 def cleaning_names(names): 
---> 33  new_names = [s.strip("'s") for s in names] # remove 's' from names 
    34  myset = list(set(new_names)) #remove duplicates 
    35  return myset 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

我尝试了修复unicode的不同方法（包括sys.setdefaultencoding('utf8')），没有任何工作。我希望以前有人有同样的问题，并能够提出修复建议。谢谢！

来源

2017-05-07 aviss

清理你的回溯。这是不可读的。 – Kanak

不知道错误发生的位置，并且不会因为库而重现。如果您手动修复名称列表，它会起作用吗？ – handle

您是否检查过**相关的问题，如右图所示？ – handle

我终于修复了我的代码。我很惊讶它看起来有多容易，但是花了我很长时间才到达那里，我看到很多人对同样的问题感到困惑，所以我决定发布我的答案。

在通过名称进行进一步清理之前添加这个小函数解决了我的问题。

def decode(names):   
    decodednames = [] 
    for name in names: 
     decodednames.append(unicode(name, errors='ignore')) 
    return decodednames

SpaCy仍然认为£590亿是一个人，但它的确定和我在一起，我可以在以后处理这个在我的代码。

工作代码：

import urllib 
import requests 
from bs4 import BeautifulSoup 
import spacy 
from spacy.en import English 
from __future__ import unicode_literals 
nlp_toolkit = English() 
nlp = spacy.load('en') 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 

    # delete unwanted tags: 
    for s in soup(['figure', 'script', 'style']): 
     s.decompose() 

    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'story-body__inner'})] 

    text = ''.join(article_soup) 
    return text 

# using spacy 
def get_names(all_tags): 
    names=[] 
    for ent in all_tags.ents: 
     if ent.label_=="PERSON": 
      names.append(str(ent)) 
    return names 

def decode(names):   
    decodednames = [] 
    for name in names: 
     decodednames.append(unicode(name, errors='ignore')) 
    return decodednames 

def cleaning_names(names): 
    new_names = [s.strip("'s") for s in names] # remove 's' from names 
    myset = list(set(new_names)) #remove duplicates 
    return myset 

def main(): 
    url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
    text=get_text(url) 
    text=u"{}".format(text) 
    all_tags = nlp(text) 
    names = get_person(all_tags) 
    print "names:" 
    print names 
    decodednames = decode(names) 
    mynewlist = cleaning_names(decodednames) 
    print mynewlist 

if __name__ == '__main__': 
    main()

这给了我这个没有任何错误：

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg', u'59bn', u'Theresa May']

来源

2017-05-09 08:37:27 aviss

当然，你可以简单地忽略所有不是ASCII的字符，这很容易。它可能会在稍后回来咬你。进行转换的正确方法是让库为你做，因为他们知道适当的编码，你不知道。 –

当您得到'ascii'编码解码器的解码错误时，这通常表示在需要Unicode字符串的上下文中正在使用字节字符串（在Python 2中，Python 3根本不会允许）。

由于您导入了from __future__ import unicode_literals，字符串"'s"是Unicode。这意味着您尝试使用strip的字符串也必须是Unicode字符串。解决这个问题，你不会再犯这个错误了。

来源

2017-05-07 22:36:35

这正是我想要解决的问题。 – aviss

@aviss你有一个答案，因为删除，这告诉你如何解决它。我不太了解'request'或'BeautifulSoup'来了解具体内容。 –

由于@MarkRansom评论忽略了非ASCII字符会咬你回来。

首先来看看

另外，请注意这是一个反模式：Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

最简单的解决方法就是使用Python3和会减少一些痛苦

>>> import requests 
>>> from bs4 import BeautifulSoup 
>>> import spacy 
>>> nlp = spacy.load('en') 

>>> url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
>>> html = requests.get(url).content 
>>> bsoup = BeautifulSoup(html, 'html.parser') 
>>> text = '\n'.join(p.text for d in bsoup.find_all('div', {'class': 'story-body__inner'}) for p in d.find_all('p') if p.text.strip()) 

>>> import spacy 
>>> nlp = spacy.load('en') 
>>> doc = nlp(text) 
>>> names = [ent for ent in doc.ents if ent.ent_type_ == 'PERSON']

来源

2017-05-17 13:59:18 alvas

如何修复UnicodeDecodeError：'ascii'编解码器无法解码字节？

回答

相关问题