0
以下是代码。基本上,我正在刮电影信息。来自IMDB.com。但不知何故,请求不会取消对象“addr”中的url。我放入parse_item2中的“打印”根本没有显示出来。Scrapy不遵循请求url
这让我疯狂。我花了几个小时。任何有经验的人都可以帮忙吗?非常感谢。
# code for the spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request, Response
from beta.items import BetaItem
import urllib2
class AlphaSpider(CrawlSpider):
name = 'alpha'
allowed_domains = ['amazon.com','imdb.com']
start_urls = ['http://www.imdb.com/search/title?at=0&sort=boxoffice_gross_us&title_type=feature&year=2005,2005']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//td/a',), allow=('/title/')), callback='parse_item1'),
)
def parse_item1(self, response):
sel = Selector(response)
item = BetaItem()
idb = sel.xpath('//link[@rel="canonical"]/@href').extract()
idb = idb[0].split('/')[-2]
item['idb'] = idb
title = sel.xpath('//h1[@class="header"]/span[@class="itemprop"]/text()').extract()
item['title'] = title
addr = 'http://www.imdb.com/title/' + idb + '/business'
request = Request(addr, callback=self.parse_item2)
request.meta['item'] = item
return request
def parse_item2(self, response):
print 'I am here'
item = response.meta['item']
sel = Selector(response)
# BLA BLA BLA
return item
'parse_item1'工作吗? 'addr'指向的页面是否存在? – Blender
嗨搅拌机,是“idb”和“标题”可以被抓取。 – maxwell
由于Scrapy的抓取队列是LIFO,因此可能需要一段时间才能到达提取的链接。你可以在特定页面上测试它吗? – Blender