2017-04-15 151 views
0

我有两个蜘蛛需要一个主蜘蛛抓取网址和数据。我的做法是在主蜘蛛中使用CrawlerProcess并将数据传递给两个蜘蛛。这里是我的方法:Scrapy从主蜘蛛运行多个蜘蛛?

class LightnovelSpider(scrapy.Spider): 

    name = "novelDetail" 
    allowed_domains = ["readlightnovel.com"] 

    def __init__(self,novels = []): 
     self.novels = novels 

    def start_requests(self): 
     for novel in self.novels: 
      self.logger.info(novel) 
      request = scrapy.Request(novel, callback=self.parseNovel) 
      yield request 

    def parseNovel(self, response): 
     #stuff here 

class chapterSpider(scrapy.Spider): 
    name = "chapters" 
    #not done here 

class initCrawler(scrapy.Spider): 
    name = "main" 
    fromMongo = {} 
    toChapter = {} 
    toNovel = [] 
    fromScraper = [] 


    def start_requests(self): 
     urls = ['http://www.readlightnovel.com/novel-list'] 

     for url in urls: 
      yield scrapy.Request(url=url,callback=self.parse) 

    def parse(self,response): 

     for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract(): 
      initCrawler.fromScraper.append(novel) 

     self.checkchanged() 

    def checkchanged(self): 
     #some scraped data processing here 
     self.dispatchSpiders() 

    def dispatchSpiders(self): 
     process = CrawlerProcess() 
     novelSpider = LightnovelSpider() 
     process.crawl(novelSpider,novels=initCrawler.toNovel) 
     process.start() 
     self.logger.info("Main Spider Finished") 

我跑“scrapy爬行主”,并得到一个美丽的错误enter image description here

主要的错误,我可以看到的是一个“twisted.internet.error.ReactorAlreadyRunning”。我不知道。有更好的方法从另一个蜘蛛运行多个蜘蛛和/或我怎样才能阻止这个错误?

回答

0

一个一些研究,我能够通过使用属性装饰“@property”来检索主蜘蛛数据这样来解决这个问题后:

class initCrawler(scrapy.Spider): 

    #stuff here from question 

    @property 
    def getNovel(self): 
     return self.toNovel 

    @property 
    def getChapter(self): 
     return self.toChapter 

然后使用CrawlerRunner这样的:

from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler 
from scrapy.crawler import CrawlerProcess,CrawlerRunner 
from twisted.internet import reactor, defer 
from scrapy.utils.log import configure_logging 
import logging 

configure_logging() 

runner = CrawlerRunner() 

@defer.inlineCallbacks 
def crawl(): 
    yield runner.crawl(initCrawler) 
    toNovel = initCrawler.toNovel 
    toChapter = initCrawler.toChapter 
    yield runner.crawl(chapterSpider,chapters=toChapter) 
    yield runner.crawl(lightnovelSpider,novels=toNovel) 

    reactor.stop() 

crawl() 
reactor.run() 
1

哇,不知道这样的东西可以工作,但我从来没有尝试过。

我在做什么,而不是当多个刮阶段必须携手合作是这两个任一选项:

选项1 - 使用数据库

当刮刀要跑在一个连续的模式下,重新扫描网站等,我只是让刮板将其结果推入数据库(通过管道)

而且后续处理的蜘蛛会从相同的数据库中提取他们需要的数据(在你的情况下,例如小说网址)。

然后使用调度程序或cron保持一切运行,蜘蛛将携手并进。

选择2 - 合并一切都变成一个蜘蛛

这就是我选择当一切都需要运行为一体脚本的方式:我创建了多个连锁请求一起几步一个蜘蛛。

class LightnovelSpider(scrapy.Spider): 

    name = "novels" 
    allowed_domains = ["readlightnovel.com"] 

    # was initCrawler.start_requests 
    def start_requests(self): 
     urls = ['http://www.readlightnovel.com/novel-list'] 

     for url in urls: 
      yield scrapy.Request(url=url,callback=self.parse_novel_list) 

    # a mix of initCrawler.parse and parts of LightnovelScraper.start_requests 
    def parse_novel_list(self,response): 
     for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract(): 
      yield Request(novel, callback=self.parse_novel) 

    def parse_novel(self, response): 
     #stuff here 
     # ... and create requests with callback=self.parse_chapters 

    def parse_chapters(self, response): 
     # do stuff 

(代码没有进行测试,它只是显示的基本概念)

如果事情变得太复杂,我拉了一些元素,并将它们转移到混入类。

在你的情况我将最有可能倾向于选择2.