2016-06-19 81 views
3

我想安排我的蜘蛛在爬行完成后的1小时内再次运行。在我的代码spider_closed方法正在调用后爬行结束。现在如何从这个方法再次运行蜘蛛。或者是否有任何可用的设置来安排scrapy蜘蛛。如何安排scrapy蜘蛛在特定时间后爬行?

这是我的基本蜘蛛代码。

import scrapy 
import codecs 
from a2i.items import A2iItem 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.conf import settings 
from scrapy.crawler import CrawlerProcess 
from scrapy import signals 
from scrapy.utils.project import get_project_settings 
from scrapy.xlib.pydispatch import dispatcher 


class A2iSpider(scrapy.Spider): 
    name = "notice" 
    f = open("urls.txt") 
    start_urls = [url.strip() for url in f.readlines()] 
    f.close() 
    allowed_domains = ["prothom-alo.com"] 

    def __init__(self): 
     dispatcher.connect(self.spider_closed, signals.spider_closed) 

    def parse(self, response): 

     for href in response.css("a::attr('href')"): 
      url = response.urljoin(href.extract()) 
      print "*"*70 
      print url 
      print "\n\n" 
      yield scrapy.Request(url, callback=self.parse_page,meta={'depth':2,'url' : url}) 


    def parse_page(self, response): 
     filename = "response.txt" 
     depth = response.meta['depth'] 

     with open(filename, 'a') as f: 
      f.write(str(depth)) 
      f.write("\n") 
      f.write(response.meta['url']) 
      f.write("\n") 

     for href in response.css("a::attr('href')"): 
      url = response.urljoin(href.extract()) 
      yield scrapy.Request(url, callback=self.parse_page,meta={'depth':depth+1,'url' : url}) 


    def spider_closed(self, spider): 
     print "$"*2000 

回答

1

您可以使用cron

crontab -e以创建时间表并以root身份运行脚本,或者作为特定用户运行 crontab -u [user] -e

在底部,您可以添加 0 * * * * cd /path/to/your/scrapy && scrapy crawl [yourScrapy] >> /path/to/log/scrapy_log.log

0 * * * *使脚本每小时运行一次,我相信你可以在网上找到有关设置的详细信息。