我需要设置referer url,在抓取一个站点之前,站点使用基于URL的身份验证,所以如果referer无效,它不允许我登录。scrapy如何设置referer url
有人可以告诉如何在Scrapy中做到这一点?
我需要设置referer url,在抓取一个站点之前,站点使用基于URL的身份验证,所以如果referer无效,它不允许我登录。scrapy如何设置referer url
有人可以告诉如何在Scrapy中做到这一点?
如果你想改变引荐在蜘蛛的要求,你可以在settings.py文件更改DEFAULT_REQUEST_HEADERS
例子:
DEFAULT_REQUEST_HEADERS = { 'Referer': 'http://www.google.com'
}
覆盖BaseSpider.start_requests
并在那里创建您的自定义Request将您的referer
标题传递给它。
刚刚成立的Referer URL的请求头
class scrapy.http.Request(url[, method='GET', body, headers, ...
headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
例子:
return Request(url=your_url, headers={'Referer':'http://your_referer_url'})
你应该做的完全一样@warwaruk表示,下面是抓取蜘蛛我的例子阐述:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = [
'http://example.com/foo'
'http://example.com/bar'
'http://example.com/baz'
]
rules = [(...)]
def start_requests(self):
requests = []
for item in start_urls:
requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'}))
return requests
def parse_me(self, response):
(...)
这应该会生成以下日志在您的终端:
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/)
(...)
将与BaseSpider一样工作。最后,start_requests方法是CrawlSpider从中继承的BaseSpider方法。
Documentation解释了除Request之外的更多选项,例如:cookies,回调函数,请求的优先级等。