Python 3中的HTTP错误403 Web Scraping

我试图为实践取消网站，但我一直在获取HTTP错误403（它是否认为我是机器人）？Python 3中的HTTP错误403 Web Scraping

这里是我的代码：

#import requests 
import urllib.request 
from bs4 import BeautifulSoup 
#from urllib import urlopen 
import re 

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read 
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>') 
findlink = re.compile('<a href =">(.*)</a>') 

row_array = re.findall(findrows, webpage) 
links = re.finall(findlink, webpate) 

print(len(row_array)) 

iterator = []

我得到的错误是：

File "C:\Python33\lib\urllib\request.py", line 160, in urlopen 
    return opener.open(url, data, timeout) 
    File "C:\Python33\lib\urllib\request.py", line 479, in open 
    response = meth(req, response) 
    File "C:\Python33\lib\urllib\request.py", line 591, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "C:\Python33\lib\urllib\request.py", line 517, in error 
    return self._call_chain(*args) 
    File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain 
    result = func(*args) 
    File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default 
    raise HTTPError(req.full_url, code, msg, hdrs, fp) 
urllib.error.HTTPError: HTTP Error 403: Forbidden

来源

2013-05-18 Josh

这可能是因为mod_security或一些类似服务器的安全功能，阻止已知的蜘蛛/机器人用户代理（ urllib使用类似python urllib/3.3.0的东西，很容易检测到）。尝试设置已知的浏览器用户代理：

from urllib.request import Request, urlopen 

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'}) 
webpage = urlopen(req).read()

这适用于我。

顺便说一下，在您的代码中，您在urlopen行中错过后的()，但我认为这是一个错字。

提示：因为这是练习，请选择其他非限制性网站。也许他们是阻止urllib出于某种原因...

来源

2013-05-18 17:52:11

仍然没有工作... – Martian2049

我上面确切的问题，这无疑为我工作。 – Samuurai

由于页作品在浏览器，而不是Python程序内调用时，它似乎是Web应用程序，供应该url识别您浏览器请求内容不。

示范：

curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1 

... 
<HTML><HEAD> 
<TITLE>Access Denied</TITLE> 
</HEAD><BODY> 
<H1>Access Denied</H1> 
You don't have permission to access ... 
</HTML>

和r.txt内容有状态行：

HTTP/1.1 403 Forbidden

尝试张贴头 '的User-Agent'，这假货 Web客户端。

注意：该页面包含Ajax调用，该调用创建您可能想要解析的表。您需要检查页面的JavaScript逻辑，或者只需使用浏览器调试器（如Firebug/Net标签）来查看需要调用哪个URL来获取表格的内容。

来源

2013-05-18 17:55:26

肯定是因为你使用基于用户代理的urllib而被阻塞。 OfferUp同样发生在我身上。您可以创建一个名为AppURLopener的新类，它使用Mozilla覆盖用户代理。

import urllib.request 

class AppURLopener(urllib.request.FancyURLopener): 
    version = "Mozilla/5.0" 

opener = AppURLopener() 
response = opener.open('http://httpbin.org/user-agent')

Source

来源

2015-08-01 06:00:29 zeta

最热门的答案对我来说不起作用，而你的确做到了。非常感谢！ – tarunuday

这工作得很好，但我需要将SSL配置附加到此。我该怎么做呢？在我刚刚添加它作为第二个参数（urlopen（request，context = ctx））之前 – Hauke

看起来像它确实打开，但它说'ValueError：读取已关闭的文件' – Martian2049

“这可能是因为的mod_security或阻止已知

spider/bot

用户代理一些类似的服务器安全功能（urllib的使用有点像蟒蛇的urllib/3.3。0，它很容易检测的）” - 如已经由斯特凡诺圣菲利波提到

from urllib.request import Request, urlopen 
url="https://stackoverflow.com/search?q=html+error+403" 
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 

web_byte = urlopen(req).read() 

webpage = web_byte.decode('utf-8')

的web_byte是由服务器和存在于网页中的内容类型返回的字节目的是大多UTF-8 因此。你需要使用解码方法来解码web_byte。

这样就解决了，而我在尝试使用PyCharm

从一个网站到报废完全问题

P.S - >我使用Python 3.4

来源

2017-12-25 07:57:59 royatirek

Python 3中的HTTP错误403 Web Scraping

回答

相关问题