2017-04-24 104 views
0

我目前正在与BeautifulSoup合作。我似乎有一些与编码有关的问题。Python 3美丽的汤网刮

这里是我的代码:

import requests 
from bs4 import BeautifulSoup 
req = requests.get('https://pythonprogramming.net/parsememcparseface/') 
soup = BeautifulSoup(req.content.decode('utf-8','ignore')) 
print(soup.find_all('p')) 

这是我的错误:

UnicodeEncodeError: 'ascii' codec can't encode character '\u1d90' in position 602: ordinal not in range(128) 

任何帮助,将不胜感激。

+0

对不起,您刚发给我的链接就是这篇文章的链接。 –

+0

你为什么要解码'req.content'? – Afaq

+0

我无法在Python 2或3中重现任何问题。无论如何,我建议用'req.text'替换req.content.decode('utf-8','ignore')'。 –

回答

0

我可以复制你的错误信息,并消除麻烦的字符。

首先这段代码简单地请求页面并尝试保存它。尝试失败并显示您看到的消息。我通过将它转换为忽略丑陋字符代码的字节然后将其转换回字符来创建页面的副本。现在页面可以成功保存。

我用它做汤,找到段落标签。

>>> from bs4 import BeautifulSoup 
>>> import requests 
>>> req = requests.get('https://pythonprogramming.net/parsememcparseface/').text 
>>> open('c:/scratch/temp.htm', 'w').write(req) 
Traceback (most recent call last): 
    File "<interactive input>", line 1, in <module> 
    File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode 
    return codecs.charmap_encode(input,self.errors,encoding_table)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u1d90' in position 6702: character maps to <undefined> 
>>> modReq = str(req.encode('utf-8', 'ignore')) 
>>> open('c:/scratch/temp.htm', 'w').write(modReq) 
12556 
>>> soup = BeautifulSoup(modReq, 'lxml') 
>>> paras = soup.findAll('p') 
>>> len(paras) 
12 
+0

非常感谢。我很感激帮助。 –

+0

非常欢迎您! –

0

请加“html5lib”或“html.parser”

#!/usr/bin/python 
# -*- coding: utf-8 -*- 

... 

# Python 3.6.0 
soup = BeautifulSoup(req.content.decode('utf-8','ignore'), "html5lib") 

# Python 2.7.12 
soup = BeautifulSoup(req.content.decode('utf-8','ignore'), "html.parser") 
+0

谢谢你的建议。我试了一下,但没有奏效。同样的错误。 –

+0

你可以给我'pip freeze'命令的结果吗? python和操作系统版本? – Dariusz

+0

Python 3.6.0。 OS X约塞米蒂10.10.5pyperclip == 1.5.27 PyScreeze == 0.1.9 PyTweening == 1.0.3 pytz == 2016.10 请求== 2.12.5 Send2Trash == 1.3.0 6 == 1.10。 0 virtualenv == 15.1.0 webencodings == 0.5.1 –

0

我试图重现,你是这里面临的问题,但无法。

这是我试过的。

>>> import requests 
>>> from bs4 import BeautifulSoup 

>>> req = requests.get('https://pythonprogramming.net/parsememcparseface/') 

>>> soup = BeautifulSoup(req.content.decode('utf-8','ignore')) 


Warning (from warnings module): 
    File "C:\Python34\lib\site-packages\bs4\__init__.py", line 166 
    markup_type=markup_type)) 
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 

To get rid of this warning, change this: 

BeautifulSoup([your markup]) 

to this: 

BeautifulSoup([your markup], "html.parser") 

>>> soup = BeautifulSoup(req.content.decode('utf-8','ignore'), 'html.parser') 
>>> print(soup.find_all('p')) 
[<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>, <p>The following table gives some general information for the following <code>programming languages</code>:</p>, <p>I think it's clear that, on a scale of 1-10, python is:</p>, <p>Javascript (dynamic data) test:</p>, <p class="jstest" id="yesnojs">y u bad tho?</p>, <p>Whᶐt hαppéns now¿</p>, <p><a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a></p>, <p> 
<a class="btn btn-flat white modal-close" href="#">Cancel</a>   
         <a class="waves-effect waves-blue blue btn btn-flat modal-action modal-close" href="#">Login</a> 
</p>, <p> 
<a class="btn btn-flat white modal-close" href="#">Cancel</a>   
           <button class="btn" type="submit" value="Register">Sign Up</button> 
</p>, <p class="grey-text text-lighten-4">Contact: [email protected]</p>, <p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>] 
+0

感谢您的尝试。我对这个问题可能会有些困惑。当他们尝试代码时,似乎没有其他人遇到这个问题。 –