在scrapy中调用crawler.engine.crawl（）是否绕过了限制机制？

为了给出一些背景知识，我正在写一个蜘蛛，在RabbitMQ主题上侦听新的网址以供蜘蛛使用。当它从队列中拉出一个URL时，它会通过调用crawler.engine.crawl（request）将它添加到爬行队列中。我注意到，如果我将200个URL放到队列中（全部用于同一个域），我有时会超时，但是如果我通过start_urls属性添加200个URL，则不会发生这种情况。在scrapy中调用crawler.engine.crawl（）是否绕过了限制机制？

所以我想知道是否正常的节流机制（每个域的并发请求，延迟等）适用于通过crawler.engine.crawl（）添加网址时？

这里是一个小的代码示例：

@defer.inlineCallbacks 
    def read(self, queue_object): 
     # pull a url from the RabbitMQ topic 
     ch,method,properties,body = yield queue_object.get() 
     if body: 
      req = Request(url=body) 
      log.msg('Scheduling ' + body + ' for crawl') 
      self.crawler.engine.crawl(req, spider=self) 
     yield ch.basic_ack(delivery_tag=method.delivery_tag)

来源

2014-10-31 Toby Hobson

它不绕过DownloaderMiddlewares或Downloader。他们直接转到Scheduler，直接绕过SpiderMiddlewares。

Source

IMO应使用process_start_requests来覆盖你的spider.start_requests使用SpiderMiddleware。

来源

2014-10-31 15:01:43 nramirezuy

谢谢，我实际上并没有绕过start_requests我有一个单独的task.LoopingCall（）基本上轮询兔子队列并调用crawler.engine.crawl（） – 2014-10-31 15:16:16

在scrapy中调用crawler.engine.crawl（）是否绕过了限制机制？

回答

相关问题