2012-05-19 39 views
11

我想要报废http://www.3andena.com/,本网站首先以阿拉伯语开头,并将语言设置存储在cookie中。如果您尝试直接通过URL()访问语言版本,则会造成问题并返回服务器错误。如何在scrapy中覆盖/使用cookies

因此,我想将cookie值“store_language”设置为“en”,然后使用此cookie值开始废弃网站。

我使用CrawlSpider与一些规则。

这里的代码

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy import log 
from bkam.items import Product 
from scrapy.http import Request 
import re 

class AndenaSpider(CrawlSpider): 
    name = "andena" 
    domain_name = "3andena.com" 
    start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"] 

    product_urls = [] 

    rules = (
    # The following rule is for pagination 
    Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True), 
    # The following rule is for produt details 
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True), 
    ) 

    def start_requests(self): 
    yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'}) 

    for url in self.start_urls: 
     yield Request(url, callback=self.parse_category) 


    def parse_category(self, response): 
    hxs = HtmlXPathSelector(response) 

    self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract()) 

    for product in self.product_urls: 
     yield Request(product, callback=self.parse_product) 


    def parse_product(self, response): 
    hxs = HtmlXPathSelector(response) 
    items = [] 
    item = Product() 

    ''' 
    some parsing 
    ''' 

    items.append(item) 
    return items 

SPIDER = AndenaSpider() 

这里的日志:

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en> 
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> 
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None) 
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10) 

回答

2
Scrapy documentation for Requests and Responses.

直你需要像这样

request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'}) 
+1

我在发布我的问题之前已经尝试过了,但它不起作用 –

+0

您能否提交您的源代码? – VenkatH

+0

我刚添加它 –

6

修改鳕鱼上课如下:

def start_requests(self): 
    for url in self.start_urls: 
     yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category) 

Scrapy.Request对象接受可选cookies关键字参数,see documentation here

6

这是我要做的事是Scrapy 0.24.6的:

from scrapy.contrib.spiders import CrawlSpider, Rule 

class MySpider(CrawlSpider): 

    ... 

    def make_requests_from_url(self, url): 
     request = super(MySpider, self).make_requests_from_url(url) 
     request.cookies['foo'] = 'bar' 
     return request 

Scrapy调用make_requests_from_url与蜘蛛的start_urls属性中的URL。上面的代码正在做的是让默认实现创建请求,然后添加一个foo cookie值为bar。 (或改变的cookie的值bar如果恰巧,克服一切困难,已经存在于缺省实现产生的请求foo饼干。)

如果你想与那些请求会发生什么而不是从start_urls创建的,请允许我补充说,Scrapy的Cookie中间件会记住使用上述代码设置的Cookie,并将其设置为与您明确添加Cookie的请求共享相同域的所有未来请求。