Scrapy刮刀不刮过第1页

我正在遵循scrapy教程here。我相信，我已经得到了与本教程相同的代码，但是我的刮板只刮掉了第一页，然后给出了关于我的第一个Request到另一个页面的消息并结束。我是否可能在错误的地方得到了第二个yield声明？Scrapy刮刀不刮过第1页

DEBUG：过滤器异地请求 'newyork.craigslist.org'：https://newyork.craigslist.org/search/egr?s=120>

2017年5月20日18时21分：31 [scrapy.core.engine] INFO：关闭蜘蛛（成品）

这是我的代码：

import scrapy 
from scrapy import Request 


class JobsSpider(scrapy.Spider): 
    name = "jobs" 
    allowed_domains = ["https://newyork.craigslist.org/search/egr"] 
    start_urls = ['https://newyork.craigslist.org/search/egr/'] 

    def parse(self, response): 
     jobs = response.xpath('//p[@class="result-info"]') 

     for job in jobs: 
      title = job.xpath('a/text()').extract_first() 
      address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1] 
      relative_url = job.xpath('a/@href').extract_first("") 
      absolute_url = response.urljoin(relative_url) 

      yield {'URL': absolute_url, 'Title': title, 'Address': address} 

     # scrape all pages 
     next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first() 
     next_page_absolute_url = response.urljoin(next_page_relative_url) 

     yield Request(next_page_absolute_url, callback=self.parse)

来源

2017-05-20 Totem

好了，我计算出来。我不得不改变这一行：

allowed_domains = ["https://newyork.craigslist.org/search/egr"]

这样：

allowed_domains = ["newyork.craigslist.org"]

，现在它的工作原理。

来源

2017-05-20 17:41:46 Totem

Scrapy刮刀不刮过第1页

回答

相关问题