2017-09-08 189 views
0

我已经设置了履带以这种方式:Scrapy - 运行蜘蛛多次

from twisted.internet import reactor 
from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 

def crawler(mood): 
     process = CrawlerProcess(get_project_settings()) 
     #crawl music selected by critics on the web 
     process.crawl('allmusic_{}_tracks'.format(mood), domain='allmusic.com') 
     # the script will block here until the crawling is finished 
     process.start() 
     #create containers for scraped data 
     allmusic = [] 
     allmusic_tracks = [] 
     allmusic_artists = [] 
     # #process pipelined files 
     with open('blogs/spiders/allmusic_data/{}_tracks.jl'.format(mood), 'r+') as t: 
      for line in t: 
       allmusic.append(json.loads(line)) 
     #fecth artists and their correspondant tracks 
     for item in allmusic: 
      allmusic_artists.append(item['artist']) 
      allmusic_tracks.append(item['track']) 
     return (allmusic_artists, allmusic_tracks) 

我可以像这样运行:

artist_list, song_list = crawler('bitter') 
print artist_list 

,它工作正常。

,但如果我想连续运行几次:

artist_list, song_list = crawler('bitter') 
artist_list2, song_list2 = crawler('harsh') 

我得到:

twisted.internet.error.ReactorNotRestartable

有一个简单的方法来建立这种蜘蛛的包装等等我可以多次运行它?

回答

0

这很简单。

函数内已经定义了一个单独的进程。

这样,我就可以这样做:

def crawler(mood1, mood2): 
     process = CrawlerProcess(get_project_settings()) 
     #crawl music selected by critics on the web 
     process.crawl('allmusic_{}_tracks'.format(mood1), domain='allmusic.com') 
     process.crawl('allmusic_{}_tracks'.format(mood2), domain='allmusic.com') 
     # the script will block here until the crawling is finished 
     process.start() 

前提是你必须为每个进程已定义的类。