2012-09-13 43 views
1

我有一个Scrapy项目,我想用它来刮一些网站。 当我尝试保存MySQL数据库中的所有信息时,标题中的错误会弹出。 我到处看书,发现它是一个“列表”问题,可能与项目[]列表相关... 您能否帮我理解这个错误的含义以及我应该在哪里修复代码? 也请解释为什么,因为我想了解。非常感谢。'GuideSpider'对象不可自订

蜘蛛代码:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders.crawl import Rule, CrawlSpider 
from scrapy.selector import HtmlXPathSelector 


from gscrape.items import GscrapeItem 

class GuideSpider(CrawlSpider): 
    name = "Dplay" 
    allowed_domains = ['www.example.com'] 
    start_urls = [ 
     "http://www.examplea.com/forums/forumdisplay.php?f=108&order=desc&page=1" 
    ] 
    rules = (
     Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=")), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     sites = hxs.select('//div') 
     for site in sites: 
      item = GscrapeItem() 
      item['title'] = site.select('a[@class="threadcolor"]/text()').extract() 
      item['guide_url'] = site.select('a[@class="threadcolor"]/@href').extract() 
      item['subject'] = site.select('./text()[1]').extract() 
      items.append(item) 
     return items 

管道代码:

from scrapy.exceptions import DropItem 
from string import join 
from scrapy import log 
from twisted.enterprise import adbapi 

import MySQLdb.cursors 

class GscrapePipeline(object): 

    def process_item(self, item, spider): 
     if item['guide_url']: 
       item['guide_url'] = "http://www.example.com/forums/" + join(item['guide_url']) 
       return item 
     else: 
      raise DropItem() 

class MySQLStorePipeline(object): 

    def __init__(self): 
     # @@@ hardcoded db settings 
     # TODO: make settings configurable through settings 
     self.dbpool = adbapi.ConnectionPool('MySQLdb', 
      db='prova', 
      host='127.0.0.1', 
      user='root', 
      passwd='', 
      cursorclass=MySQLdb.cursors.DictCursor, 
      charset='utf8', 
      use_unicode=True 
     ) 

    def process_item(self, spider, item): 
     # run db query in thread pool 
     query = self.dbpool.runInteraction(self._conditional_insert, item) 
     query.addErrback(self.handle_error) 

     return item 

    def _conditional_insert(self, tx, item): 
    # create record if doesn't exist. 
    # all this block run on it's own thread 
     tx.execute("select * from prova where guide_url = %s", item['guide_url']) 
     result = tx.fetchone() 
     if result: 
      log.msg("Item already stored in db: %s" % item, level=log.DEBUG) 
     else: 
      tx.execute(\ 
      "insert into prova (title, guide_url, subject) " 
      "values (%s, %s, %s)", 
      (item['title'], 
      item['guide_url'], 
      item['subject'] 
       )) 
      log.msg("Item stored in db: %s" % item, level=log.DEBUG) 

    def handle_error(self, e): 
     log.err(e) 

错误:exceptions.TypeError: 'GuideSpider' 对象不是标化的(线47)pipelines.py

+1

你可以包含整个堆栈跟踪的所有错误行。 –

回答

1

根据docs

process_item(item, spider) 

我的意思是在你的管道:

def process_item(self, spider, item): 

你有一个参数错误的顺序,这意味着你传递给你的_conditional_insert蜘蛛,而不是项目。

学习用户调试器。与ipdb安装IPython都装上线47本:

import ipdb; ipdb.set_trace() 

当程序将达到违规的时候,你就可以研究的变量和回溯。

+0

ipdb链接已损坏。如果你想安装ipdb for python 3,请执行以下操作:http://stackoverflow.com/questions/17242207/install-ipdb-for-python-3#17242289为我工作。 –