2012-10-25 59 views
7

我需要设置referer url,在抓取一个站点之前,站点使用基于URL的身份验证,所以如果referer无效,它不允许我登录。scrapy如何设置referer url

有人可以告诉如何在Scrapy中做到这一点?

回答

11

如果你想改变引荐在蜘蛛的要求,你可以在settings.py文件更改DEFAULT_REQUEST_HEADERS

例子:

DEFAULT_REQUEST_HEADERS = { 'Referer': 'http://www.google.com'
}

3

刚刚成立的Referer URL的请求头

class scrapy.http.Request(url[, method='GET', body, headers, ...

headers (dict) – the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).

例子:

return Request(url=your_url, headers={'Referer':'http://your_referer_url'})

6

你应该做的完全一样@warwaruk表示,下面是抓取蜘蛛我的例子阐述:

from scrapy.contrib.spiders import CrawlSpider 
from scrapy.http import Request 

class MySpider(CrawlSpider): 
    name = "myspider" 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     'http://example.com/foo' 
     'http://example.com/bar' 
     'http://example.com/baz' 
     ] 
    rules = [(...)] 

    def start_requests(self): 
    requests = [] 
    for item in start_urls: 
     requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'})) 
    return requests  

    def parse_me(self, response): 
    (...) 

这应该会生成以下日志在您的终端:

(...) 
[myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/) 
(...) 
[myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/) 
(...) 
[myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/) 
(...) 

将与BaseSpider一样工作。最后,start_requests方法是CrawlSpider从中继承的BaseSpider方法。

Documentation解释了除Request之外的更多选项,例如:cookies,回调函数,请求的优先级等。