2017-05-07 78 views
0
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) 

这是我在尝试清理使用spaCy从html页面提取的名称列表时遇到的错误。如何修复UnicodeDecodeError:'ascii'编解码器无法解码字节?

我的代码:

import urllib 
import requests 
from bs4 import BeautifulSoup 
import spacy 
from spacy.en import English 
from __future__ import unicode_literals 
nlp_toolkit = English() 
nlp = spacy.load('en') 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 

    # delete unwanted tags: 
    for s in soup(['figure', 'script', 'style']): 
     s.decompose() 

    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'story-body__inner'})] 

    text = ''.join(article_soup) 
    return text 

# using spacy 
def get_names(all_tags): 
    names=[] 
    for ent in all_tags.ents: 
     if ent.label_=="PERSON": 
      names.append(str(ent)) 
    return names 

def cleaning_names(names): 
    new_names = [s.strip("'s") for s in names] # remove 's' from names 
    myset = list(set(new_names)) #remove duplicates 
    return myset 

def main(): 
    url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
    text=get_text(url) 
    text=u"{}".format(text) 
    all_tags = nlp(text) 
    names = get_person(all_tags) 
    print "names:" 
    print names 
    mynewlist = cleaning_names(names) 
    print mynewlist 

if __name__ == '__main__': 
    main() 

对于这个特定的网址,我得到的名字的名单,其中包括像£或字符$:

['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May']

然后错误:

Traceback (most recent call last) <ipython-input-19-8582e806c94a> in <module>() 
    47 
    48 if __name__ == '__main__': 
---> 49  main() 

<ipython-input-19-8582e806c94a> in main() 
    43  print "names:" 
    44  print names 
---> 45  mynewlist = cleaning_names(names) 
    46  print mynewlist 
    47 

<ipython-input-19-8582e806c94a> in cleaning_names(names) 
    31 
    32 def cleaning_names(names): 
---> 33  new_names = [s.strip("'s") for s in names] # remove 's' from names 
    34  myset = list(set(new_names)) #remove duplicates 
    35  return myset 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) 

我尝试了修复unicode的不同方法(包括sys.setdefaultencoding('utf8')),没有任何工作。我希望以前有人有同样的问题,并能够提出修复建议。谢谢!

+0

清理你的回溯。这是不可读的。 – Kanak

+0

不知道错误发生的位置,并且不会因为库而重现。如果您手动修复名称列表,它会起作用吗? – handle

+1

您是否检查过**相关的问题,如右图所示? – handle

回答

0

我终于修复了我的代码。我很惊讶它看起来有多容易,但是花了我很长时间才到达那里,我看到很多人对同样的问题感到困惑,所以我决定发布我的答案。

在通过名称进行进一步清理之前添加这个小函数解决了我的问题。

def decode(names):   
    decodednames = [] 
    for name in names: 
     decodednames.append(unicode(name, errors='ignore')) 
    return decodednames 

SpaCy仍然认为£590亿是一个人,但它的确定和我在一起,我可以在以后处理这个在我的代码。

工作代码:

import urllib 
import requests 
from bs4 import BeautifulSoup 
import spacy 
from spacy.en import English 
from __future__ import unicode_literals 
nlp_toolkit = English() 
nlp = spacy.load('en') 

def get_text(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content, "lxml") 

    # delete unwanted tags: 
    for s in soup(['figure', 'script', 'style']): 
     s.decompose() 

    # use separator to separate paragraphs and subtitles! 
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all('div', {'class': 'story-body__inner'})] 

    text = ''.join(article_soup) 
    return text 

# using spacy 
def get_names(all_tags): 
    names=[] 
    for ent in all_tags.ents: 
     if ent.label_=="PERSON": 
      names.append(str(ent)) 
    return names 

def decode(names):   
    decodednames = [] 
    for name in names: 
     decodednames.append(unicode(name, errors='ignore')) 
    return decodednames 

def cleaning_names(names): 
    new_names = [s.strip("'s") for s in names] # remove 's' from names 
    myset = list(set(new_names)) #remove duplicates 
    return myset 

def main(): 
    url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
    text=get_text(url) 
    text=u"{}".format(text) 
    all_tags = nlp(text) 
    names = get_person(all_tags) 
    print "names:" 
    print names 
    decodednames = decode(names) 
    mynewlist = cleaning_names(decodednames) 
    print mynewlist 

if __name__ == '__main__': 
    main() 

这给了我这个没有任何错误:

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg', u'59bn', u'Theresa May']

+1

当然,你可以简单地忽略所有不是ASCII的字符,这很容易。它可能会在稍后回来咬你。进行转换的正确方法是让库为你做,因为他们知道适当的编码,你不知道。 –

1

当您得到'ascii'编码解码器的解码错误时,这通常表示在需要Unicode字符串的上下文中正在使用字节字符串(在Python 2中,Python 3根本不会允许) 。

由于您导入了from __future__ import unicode_literals,字符串"'s"是Unicode。这意味着您尝试使用strip的字符串也必须是Unicode字符串。解决这个问题,你不会再犯这个错误了。

+0

这正是我想要解决的问题。 – aviss

+0

@aviss你有一个答案,因为删除,这告诉你如何解决它。我不太了解'request'或'BeautifulSoup'来了解具体内容。 –

0

由于@MarkRansom评论忽略了非ASCII字符会咬你回来。

首先来看看

另外,请注意这是一个反模式:Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

最简单的解决方法就是使用Python3和会减少一些痛苦

>>> import requests 
>>> from bs4 import BeautifulSoup 
>>> import spacy 
>>> nlp = spacy.load('en') 

>>> url = "http://www.bbc.co.uk/news/uk-politics-39784164" 
>>> html = requests.get(url).content 
>>> bsoup = BeautifulSoup(html, 'html.parser') 
>>> text = '\n'.join(p.text for d in bsoup.find_all('div', {'class': 'story-body__inner'}) for p in d.find_all('p') if p.text.strip()) 

>>> import spacy 
>>> nlp = spacy.load('en') 
>>> doc = nlp(text) 
>>> names = [ent for ent in doc.ents if ent.ent_type_ == 'PERSON'] 

相关问题