2014-02-14 42 views
0

我试图写入我的蜘蛛的__init__方法的日志,但我似乎无法得到它的工作,尽管它从解析方法工作正常。Scrapy - 无法写入登录蜘蛛的__init__方法

init方法中对self.log的调用由方法'get_urls_from_file'进行。我知道该方法正在被调用,因为我在标准输出中看到了print语句,所以我想知道是否有人可以指向正确的方向。我正在使用scrapy v0.18。谢谢!

我的代码如下:

from scrapy.spider import BaseSpider 
from scrapy_redis import connection 
from importlib import import_module 
from scrapy import log 
from scrapy.settings import CrawlerSettings 

class StressS(BaseSpider): 
    name = 'stress_s_spider'              
    allowed_domains = ['www.example.com'] 

    def __init__(self, url_file=None, *args, **kwargs): 
     super(StressS, self).__init__(*args, **kwargs) 
     settings = CrawlerSettings(import_module('stress_test.settings')) 
     if url_file: 
      self.url_file = url_file 
     else: 
      self.url_file = settings.get('URL_FILE') 
     self.start_urls = self.get_urls_from_file(self.url_file) 
     self.server = connection.from_settings(settings) 
     self.count_key = settings.get('ITEM_COUNT') 

    def parse(self, response): 
     self.log('Processed: %s, status code: %s' % (response.url, response.status), level = log.INFO) 
     self.server.incr(self.count_key) 

    def get_urls_from_file(self, fn): 
     urls = [] 
     if fn: 
      try: 
       with open(fn, 'r') as f: 
        urls = [line.strip() for line in f.readlines()] 
      except IOError: 
       msg = 'File %s could not be opened' % fn 
       print msg 
       self.log(msg, level = log.ERROR) 
     return urls 
+0

要使用哪里'self.log' INT你的'__init__'方法? –

+0

只是编辑问题以反映这一点 - 在init中,我在get_urls_from_file方法中调用self.log。 – user2871292

回答

1

您可以覆盖start_requests方法:

# Default value for the argument in case it's missing. 
    url_file = None 

    def start_requests(self): 
     settings = self.crawler.settings 
     url_file = self.url_file if self.url_file else settings['URL_FILE'] 
     # set up server and count_key ... 
     # finally yield the requests 
     for url in self.get_urls_from_file(url_file): 
      yield Request(url, dont_filter=True) 

你也可以覆盖的方法set_crawler并设置有属性:

def set_crawler(self, crawler): 
     super(MySpider, self).set_crawler(crawler) 
     settings = crawler.settings 
     # set up start_urls ... 
+0

这似乎是一个合理的解决方法,尤其是给予类似需求的set_crawler方法。出于好奇,你知道为什么在init方法中写入不会工作吗?是否因为日志在被调用时未被初始化为写入? – user2871292

+0

@ user2871292,蜘蛛很早就被实例化了,因此你不能访问许多还没有设置的对象,比如self.crawler。 – Rolando

0

Scrapy 0.22
它看起来是不可能的。