Scrapy：等待一些网址被解析，然后做点什么

我有一只需要找到产品价格的蜘蛛。这些产品成批地分组在一起（来自数据库），并且具有批处理状态（RUNNING，DONE）以及start_time和finished_time属性会很好。所以我有这样的：Scrapy：等待一些网址被解析，然后做点什么

class PriceSpider(scrapy.Spider): 
    name = 'prices' 

    def start_requests(self): 
     for batch in Batches.objects.all(): 
      batch.started_on = datetime.now() 
      batch.status = 'RUNNING' 
      batch.save() 
      for prod in batch.get_products(): 
       yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod}) 
      batch.status = 'DONE' 
      batch.finished_on = datetime.now() 
      batch.save() # <-- NOT COOL: This is goind to 
          # execute before the last product 
          # url is scraped, right? 

    def parse(self, response): 
     #...

这里的问题是由于scrapy的异步性质，批次对象的第二状态更新将会太快运行，对吧？有没有办法将这些请求以某种方式组合在一起，并在最后一个被分析时更新批处理对象？

来源

2017-02-14 Tony Lâmpada

我对@Umair提出了一些修改建议离子，想出了的伟大工程，为我的情况的解决方案：

class PriceSpider(scrapy.Spider): 
    name = 'prices' 

    def start_requests(self): 
     for batch in Batches.objects.all(): 
      batch.started_on = datetime.now() 
      batch.status = 'RUNNING' 
      batch.save() 
      products = batch.get_products() 
      counter = {'curr': 0, 'total': len(products)} # the counter dictionary 
                  # for this batch 
      for prod in products: 
       yield scrapy.Request(product.get_scrape_url(), 
            meta={'prod': prod, 
              'batch': batch, 
              'counter': counter}) 
            # trick = add the counter in the meta dict 

    def parse(self, response): 
     # process the response as desired 
     batch = response.meta['batch'] 
     counter = response.meta['counter'] 
     self.increment_counter(batch, counter) # increment counter only after 
               # the work is done 

    def increment_counter(batch, counter): 
     counter['curr'] += 1 
     if counter['curr'] == counter['total']: 
      batch.status = 'DONE' 
      batch.finished_on = datetime.now() 
      batch.save() # GOOD! 
          # Well, almost...

这只要通过start_requests产生的全部请求具有不同的URL的正常工作。

如果有任何重复，scrapy将过滤出来，不要让你的parse方法，所以你最终counter['curr'] < counter['total']和批次状态保持运行，直到永远。

事实证明，您可以覆盖scrapy的重复行为。

首先，我们需要改变settings.py指定备用“重复过滤器”类：

DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'

然后我们创建MyDupeFilter类，让蜘蛛知道什么时候有一个重复：

class MyDupeFilter(RFPDupeFilter): 
    def log(self, request, spider): 
     super(MyDupeFilter, self).log(request, spider) 
     spider.look_a_dupe(request)

然后我们修改我们的蜘蛛，使其增加计数器时重复发现：

class PriceSpider(scrapy.Spider): 
    name = 'prices' 

    #... 

    def look_a_dupe(self, request): 
     batch = request.meta['batch'] 
     counter = request.meta['counter'] 
     self.increment_counter(batch, counter)

我们很好走

来源

2017-02-21 20:13:30

对于这种交易，您可以使用signal closed，您可以绑定一个函数以在蜘蛛完成爬网时运行。

来源

2017-02-14 13:40:57

有趣的是，我看到这些信号可能是有用的。在这种情况下，虽然可能“关闭”不是正确的（因为蜘蛛会处理多个批次，理想情况下我想知道每个批次的完成时间） –

这是欺骗

每个请求，发送batch_id，total_products_in_this_batch和processed_this_batch

，任何地点以任何功能检查

for batch in Batches.objects.all(): 
    processed_this_batch = 0 
    # TODO: Get some batch_id here 
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch` 

    for prod in batch.get_products(): 
     processed_this_batch = processed_this_batch + 1 
     yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

而且在任何地方的代码，对任何特定批次，检验if processed_this_batch == total_products_in_this_batch然后保存批处理

来源

2017-02-14 19:09:39 Umair

这看起来确实是一个好主意。将测试，谢谢！ –

它并没有完全按照你的建议工作（我必须在'parse'方法中增加计数器，如果我在这样做之前做了这个请求，我最终会在批处理完成之前就完成标记）。但是你的建议DID指向了正确的方向，所以非常感谢！ –

顺便说一句，我结束了我的完整解决方案回答这个问题 –

Scrapy：等待一些网址被解析，然后做点什么

回答

相关问题