我想要报废http://www.3andena.com/,本网站首先以阿拉伯语开头,并将语言设置存储在cookie中。如果您尝试直接通过URL()访问语言版本,则会造成问题并返回服务器错误。如何在scrapy中覆盖/使用cookies
因此,我想将cookie值“store_language”设置为“en”,然后使用此cookie值开始废弃网站。
我使用CrawlSpider与一些规则。
这里的代码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re
class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]
product_urls = []
rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)
def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})
for url in self.start_urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())
for product in self.product_urls:
yield Request(product, callback=self.parse_product)
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()
'''
some parsing
'''
items.append(item)
return items
SPIDER = AndenaSpider()
这里的日志:
2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
我在发布我的问题之前已经尝试过了,但它不起作用 –
您能否提交您的源代码? – VenkatH
我刚添加它 –