Scrapy在看似随机点

我从this site刮伦敦住房的广告。

可以搜索3种不同面积的房屋广告：伦敦的全部，特定地区（例如伦敦中部）或特定分区（如Aldgate）的房屋广告。

该网站仅允许您检查每个区域30个广告的50个页面，无论该区域的大小如何。即如果我选择X，我可以在X中查看1500个广告，无论X是伦敦中心还是Aldgate。

在写这个问题的时候，网站上有超过37000个广告。

因为我想尽可能多的广告，这个限制意味着我需要在小区级别上刮广告。

要做到这一点，我写了下面的蜘蛛，

# xpath to area/sub area links 
area_links = ('''//*[@id="fullListings"]/div[1]/div/div/nav/aside/''' 
      '''section[1]/div/ul/li/a/@href''') 

class ApartmentSpider(scrapy.Spider): 
    name = 'apartments2' 
    start_urls = [ 
     "https://www.gumtree.com/property-to-rent/london" 
     ] 

    # obtain links to london areas 
    def parse(self, response):     
      for url in response.xpath(area_links).extract(): 
       yield scrapy.Request(response.urljoin(url), 
         callback=self.parse_sub_area)  

    # obtain links to london sub areas 
    def parse_sub_area(self, response):     
      for url in response.xpath(area_links).extract(): 
       yield scrapy.Request(response.urljoin(url), 
         callback=self.parse_ad_overview)  

    # obtain ads per sub area page 
    def parse_ad_overview(self, response):     
      for ads in response.xpath('//*[@id="srp-results"]/div[1]/div/div[2]', 
            ).css('ul').css('li').css('a', 
              ).xpath('@href').extract(): 
       yield scrapy.Request(response.urljoin(ads), 
         callback=self.parse_ad) 

       next_page = response.css(
      '#srp-results > div.grid-row > div > ul > li.pagination-next > a', 
             ).xpath('@href').extract_first() 
       if next_page is not None: 
        next_page = response.urljoin(next_page) 
        yield scrapy.Request(next_page, callback=self.parse) 

    # obtain info per ad 
    def parse_ad(self, response): 

    # here follows code to extract of data per ad

工作正常。

也就是说，它获得

住房的广告每分区域页面的链接的，从最初的页面从各自的区域页面

子区域

区，每区，遍历每个子区域的所有页面

最终从每个广告中刮取数据。

问题

代码停在看似随意刮了，我不知道为什么。

我怀疑它已经达到了极限，因为它被告知要刮许多链接和项目，但我不确定我是否正确。

当它停了，它指出，

{'downloader/request_bytes': 1295950, 
'downloader/request_count': 972, 
'downloader/request_method_count/GET': 972, 
'downloader/response_bytes': 61697740, 
'downloader/response_count': 972, 
'downloader/response_status_count/200': 972, 
'dupefilter/filtered': 1806, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 9, 4, 17, 13, 35, 53156), 
'item_scraped_count': 865, 
'log_count/DEBUG': 1839, 
'log_count/ERROR': 5, 
'log_count/INFO': 11, 
'request_depth_max': 2, 
'response_received_count': 972, 
'scheduler/dequeued': 971, 
'scheduler/dequeued/memory': 971, 
'scheduler/enqueued': 971, 
'scheduler/enqueued/memory': 971, 
'spider_exceptions/TypeError': 5, 
'start_time': datetime.datetime(2017, 9, 4, 17, 9, 56, 132388)}

我不知道，如果人们可以从这个是否我已经打了极限或阅读的东西，但如果有人不知道，请让我知道如果我做了，如何防止代码停止。

来源

2017-09-04 LucSpan

您只获得状态200响应。如果事情真的发生了错误或者您被阻止，您将得到服务不可用的响应（503）或类似情况。您是否认为代码过早停止，因为项目数量在不同的运行中会有所不同？ – Andras

嗨安德拉斯，恐怕我不明白你的意思是'物品数量因不同跑步而异'。 – LucSpan

为什么你认为你的代码会提前停止提取？ – Andras

尽管完整的或至少部分的抓取过程日志会帮助您排除故障，但是我要承担风险并发布此答案，因为我看到了一件事;我假设是问题

def parse_ad_overview(self, response):     
      for ads in response.xpath('//*[@id="srp-results"]/div[1]/div/div[2]', 
            ).css('ul').css('li').css('a', 
              ).xpath('@href').extract(): 
       yield scrapy.Request(response.urljoin(ads), 
         callback=self.parse_ad) 

       next_page = response.css(
      '#srp-results > div.grid-row > div > ul > li.pagination-next > a', 
             ).xpath('@href').extract_first() 
       if next_page is not None: 
        next_page = response.urljoin(next_page) 
        yield scrapy.Request(next_page, callback=self.parse)

我敢肯定，我知道发生了什么事情，跑了过去类似的问题，看着你的脚本，当你从最后一个函数的回调运行你的下一个页面发回它解析...其中我假设到下一页的链接是在那些情况下http responce ...所以只需将回调改为parse_ad_overview ...

来源

2017-09-05 00:00:26 scriptso

Scrapy在看似随机点

回答

相关问题