编码错误刮

中国网站与中国符号simbols。我如何报废中文simbolse？编码错误刮

from urllib.request import urlopen 
from urllib.parse import urljoin 

from lxml.html import fromstring 

URL = 'http://list.suning.com/0-258003-0.html' 
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point' 

def parse_items(): 
    f = urlopen(URL) 
    list_html = f.read().decode('utf-8') 
    list_doc = fromstring(list_html) 

    for elem in list_doc.cssselect(ITEM_PATH): 
     a = elem.cssselect('a')[0] 
     href = a.get('href') 
     title = a.text 
     em = elem.cssselect('em')[0] 
     title2 = em.text 
     print(href, title, title2) 


def main(): 
    parse_items() 

if __name__ == '__main__': 
    main()

错误是这样的。错误看起来像这样错误看起来像这样错误看起来像这样错误看起来像这样

http://product.suning.com/0000000000/146422477.html Traceback (most recent call last): 
    File "parser.py", line 27, in <module> 
    main() 
    File "parser.py", line 24, in main 
    parse_items() 
    File "parser.py", line 20, in parse_items 
    print(href, title, title2) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

来源

2016-06-14 Andrew Gowa

请提供你给了我们在这个问题 – DomTomCat

我有一些问题，UTF-8编码的完整的错误堆栈。添加 –

也许这个答案[http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec -cant-encode-character-u-xa0-in-position-20）可以帮助你。 – gzc

从print语法和进口，我假设你使用Python3版本，因为它能够决定的事情对Unicode 。

因此，我们可以预计href,title和title2都是unicode字符串（或Python 3字符串）。但是打印函数会尝试将字符串转换为输出系统可以接受的编码 - 出于我不知道的原因，您的系统默认使用ASCII，所以错误。

如何解决：

最好的办法是让你的系统接受统一。在Linux或其他unix上，你可以在LANG环境变量（export LANG=en_US.UTF-8）中声明一个UTF8字符集，在Windows上你可以试试chcp 65001，但是如果它不起作用，或者不符合你的需要，那么后者如果远不能确定
，您可以强制进行显式编码，或者更精确地过滤掉违规字符，因为Python3本身使用unicode字符串。

我会使用：

import sys 

def u_filter(s, encoding = sys.stdout.encoding): 
    return (s.encode(encoding, errors='replace').decode(encoding) 
     if isinstance(s, str) else s)

这意味着：如果s是Unicode字符串进行编码，在用于标准输出的编码，用替换字符替换任何非转换字符，并将其回解码成现在一个干净的字符串

和未来：

def fprint(*args, **kwargs): 
    fargs = [ u_filter(arg) for arg in args ] 
    print(*fargs, **kwargs)

意味着：过滤掉任何违规的ç来自unicode字符串的字符并打印其余的不变。

这样，您可以安全地更换打印与抛出异常：

fprint(href, title, title2)

来源

2016-06-14 14:34:15

回答

相关问题