Python：打印二进制字符串

我有一个爬行程序，它解析给定站点的HMTL并打印部分源代码。这里是我的脚本：Python：打印二进制字符串

#!/usr/bin/env python 
# -*- encoding: utf-8 -*- 
from bs4 import BeautifulSoup 
import requests 
import urllib.request 
import re 

class Crawler: 

    headers = {'User-Agent' : 'Mozilla/5.0'} 
    keyword = 'arroz' 

    def extra(self): 
     url = "http://buscando.extra.com.br/search?w=" + self.keyword 
     r = requests.head(url, allow_redirects=True)  
     print(r.url) 
     html = urllib.request.urlopen(urllib.request.Request(url, None, self.headers)).read() 
     soup = BeautifulSoup(html, 'html.parser') 
     return soup.encode('utf-8') 

    def __init__(self): 
     extra = self.extra() 
     print(extra) 

Crawler()

我的代码工作正常，但打印源与前值恼人b'。我已经尝试使用decode('utf-8')，但它没有奏效。有任何想法吗？

UPDATE

如果我不使用encode('utf-8')我有以下错误：

Traceback (most recent call last): 
    File "crawler.py", line 25, in <module> 
    Crawler() 
    File "crawler.py", line 23, in __init__ 
    print(extra) 
    File "c:\Python34\lib\encodings\cp850.py", line 19, in encode 
    return codecs.charmap_encode(input,self.errors,encoding_map)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u2030' in position 
13345: character maps to <undefined>

来源

2015-11-01 bodruk

那么你为什么在这里使用'encode'？尝试“返回汤”。 –

没有这个它返回以下错误： '回溯（最近通话最后一个）：文件 “crawler.py” 25行，在爬虫（）文件 “crawler.py” 23行，在__init__ 打印（额外）文件“c：\ Python34 \ lib \ encodings \ cp850.py”，第19行，编码为 return codecs.charmap_encode（input，self。错误，encoding_map）[0] UnicodeEncodeError：'charmap'编解码器不能编码字符'\ u2030'在位置 13345：字符映射到' – bodruk

'bytes'在Python 3中没有'encode'方法你开始一个字符串，并将其转换为字节字符串 – chucksmash

当我运行代码除与return soup更换return soup.encode('utf-8')给出，它工作正常。我的环境：

操作系统：Ubuntu的15.10
的Python：3.4.3
python3的dist-包BS4版本：beautifulsoup4==4.3.2

这使我怀疑问题出在你的环境中，不你的代码。你的堆栈跟踪提到cp850.py，你的源代码正在击中一个.com.br站点 - 这让我认为你的shell的默认编码可能无法处理unicode字符。这里是cp850 - Code Page 850的维基百科页面。

您可以检查编码终端默认是使用：

>>> import sys 
>>> sys.stdout.encoding

我的终端与回应：

'UTF-8'

我假设你会不会和这是根您遇到的问题。

编辑：

其实，我可以正好与复制您的错误：

>>> print("\u2030".encode("cp850"))

所以这是个问题 - 因为你的计算机的区域设置，print被隐式转换为系统的默认编码并引发UnicodeDecodeError。

更新Windows以显示来自命令提示符的unicode字符位于我的驾驶室之外，所以除了指导您到relevant question/answer之外，我无法提供任何建议。

来源

2015-11-01 04:26:45 chucksmash

是啊！它返回了'cp850'。我该怎么办？ – bodruk

我试过链接解决方案，没有成功......谢谢你。 – bodruk

Python：打印二进制字符串

回答

相关问题