我以前发布过类似的question。我试着用以下方法使用Python无法抓取网页
import requests
url = 'https://www.zameen.com/'
res = requests.get(url)
data = res.text
print(data)
其响应说,我无论是BOT或不使用Javascript功能来刮web page。所以,我已经检查,但Javascript已启用。所以我试图用伪造的用户代理下面的代码
from fake_useragent import UserAgent
headers = {}
headers['User-Agent'] = str(ua.chrome)
web_page = requests.get(url,headers=headers)
print(web_page.content)
响应的另一种方法:
b'<!DOCTYPE html>\n\n\t\n\n\t\n\t\n\t\n\n\t\n\t\n\n\t\n\t\n\t\n\n<head>\n<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">\n<meta http-equiv="cache-control" content="max-age=0" />\n<meta http-equiv="cache-control" content="no-cache" />\n<meta http-equiv="expires" content="0" />\n<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />\n<meta http-equiv="pragma" content="no-cache" />\n<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/&distil_RID=053235A2-0030-11E7-8429-B03805AB611E&distil_TID=20170303163950" />\n<script type="text/javascript">\n\t(function(window){\n\t\ttry {\n\t\t\tif (typeof sessionStorage !== \'undefined\'){\n\t\t\t\tsessionStorage.setItem(\'distil_referrer\', document.referrer);\n\t\t\t}\n\t\t} catch (e){}\n\t})(window);\n</script>\n<script type="text/javascript" src="/ga368490.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#caexxxzxycbzutyvy{display:none!important}</style></head>\n<body>\n<div id="distil_ident_block"> </div>\n</body>\n</html>\n'
它再次检测我作为一个机器人。所以我检查了是否可以从网站获取数据。然后我用从urllib的
from urllib import robotparser
req = robotparser.RobotFileParser()
req.set_url(url)
req.read()
print(req.can_fetch('*','https://www.zameen.com/'))
返回robotparser:
TRUE # Means I can fetch the data from the website.
有没有什么办法让这个网页的数据?谢谢
请检查这个答案:HTTP://计算器.com/questions/8049520/web-scraping-javascript-page-with-python – foobar
我不确定发生了什么事。我尝试使用机械化并将robots_handle设置为false,但由于某种原因它给出了405错误。 '405'错误 – Shashank
我应该注意到,有问题的网站希望不会被僵尸程序访问,通过响应中的元标记:'' –