2013-07-19 102 views
3

我跟着这两个职位的意见,我也想创建一个通用的scrapy蜘蛛:传递参数给scrapy

How to pass a user defined argument in scrapy spider

Creating a generic scrapy spider

但我发现了一个错误的变量我应该通过作为参数未定义。我是否在我的init方法中丢失了某些东西?

代码:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from data.items import DataItem 

class companySpider(BaseSpider): 
    name = "woz" 

    def __init__(self, domains=""): 
     ''' 
     domains is a string 
     ''' 
     self.domains = domains 

    deny_domains = [""] 
    start_urls = [domains] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('/html') 
     items = [] 
     for site in sites: 
      item = DataItem() 
      item['text'] = site.select('text()').extract() 
      items.append(item) 
     return items 

这里是我的命令行:

scrapy crawl woz -a domains="http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" 

这里是错误:

NameError: name 'domains' is not defined 
+0

我忘了引用变量start_urls作为self.domains,但现在的错误说自己没有定义。我有自己的问题的答案,但必须等待4个小时才能发布。未完待续... – jstaker7

回答

4

,你应该在你的__init__的开始打电话super(companySpider, self).__init__(*args, **kwargs)

def __init__(self, domains="", *args, **kwargs): 
    super(companySpider, self).__init__(*args, **kwargs) 
    self.domains = domains 

在你的情况下你的第一个请求依赖于蜘蛛的说法,我通常只覆盖start_requests()方法,没有覆盖__init__()。在命令行参数名称是已位于可作为属性蜘蛛:

class companySpider(BaseSpider): 
    name = "woz" 
    deny_domains = [""] 

    def start_requests(self): 
     yield Request(self.domains) # for example if domains is a single URL 

    def parse(self, response): 
     ...