2014-03-04 48 views
1

我试图从本网站刮取一些基本数据作为练习,以了解有关scrapy的更多信息,以及作为大学项目的概念证明: http://steamdb.info/sales/Scrapy - 声明了非ascii字符,但没有声明编码

当我使用scrapy壳我能得到我使用以下XPath想要的信息:

sel.xpath(‘//tbody/tr[1]/td[2]/a/text()’).extract() 

应返回表的第一行的游戏的标题,在结构:

<tbody> 
    <tr> 
      <td></td> 
      <td><a>stuff I want here</a></td> 
... 

它在壳中。

然而,当我试图把这个变成了一只蜘蛛(steam.py):

1 from scrapy.spider import BaseSpider 
2 from scrapy.selector import HtmlXPathSelector 
3 from steam_crawler.items import SteamItem 
4 from scrapy.selector import Selector 
5 
6 class SteamSpider(BaseSpider): 
7  name = "steam" 
8  allowed_domains = ["http://steamdb.info/"] 
9  start_urls = ['http://steamdb.info/sales/?displayOnly=all&category=0&cc=uk'] 
10  def parse(self, response): 
11   sel = Selector(response) 
12   sites = sel.xpath("//tbody") 
13   items = [] 
14   count = 1 
15   for site in sites: 
16    item = SteamItem() 
17    item ['title'] = sel.xpath('//tr['+ str(count) +']/td[2]/a/text()').extract().encode('utf-8') 
18    item ['price'] = sel.xpath('//tr['+ str(count) +']/td[@class=“price-final”]/text()').extract().encode('utf-8') 
19    items.append(item) 
20    count = count + 1 
21   return items 

我得到以下错误:

ricks-mbp:steam_crawler someuser$ scrapy crawl steam -o items.csv -t csv 
Traceback (most recent call last): 
    File "/usr/local/bin/scrapy", line 5, in <module> 
    pkg_resources.run_script('Scrapy==0.20.0', 'scrapy') 
    File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 492, in run_script 

    File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 1350, in run_script 
    for name in eagers: 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module> 
    execute() 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute 
    _run_print_help(parser, _run_command, cmd, args, opts) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help 
    func(*a, **kw) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command 
    cmd.run(args, opts) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/commands/crawl.py", line 47, in run 
    crawler = self.crawler_process.create_crawler() 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler 
    self.crawlers[name] = Crawler(self.settings) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 25, in __init__ 
    self.spiders = spman_cls.from_crawler(self) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler 
    sm = cls.from_settings(crawler.settings) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings 
    return cls(settings.getlist('SPIDER_MODULES')) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__ 
    for module in walk_modules(name): 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules 
    submod = import_module(fullpath) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module 
    __import__(name) 
    File "/xxx/scrape/steam/steam_crawler/spiders/steam.py", line 18 
SyntaxError: Non-ASCII character '\xe2' in file /xxx/scrape/steam/steam_crawler/spiders/steam.py on line 18, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details 

我有一种感觉,我需要do是以某种方式告诉scrapy,这些角色会遵循utf-8而不是ascii--就像有些人一样。但是从我能收集到的信息来看,它应该从页面头部收集这些信息,本网站为:

<meta charset="utf-8"> 

让我感到莫名其妙!任何不是scrapy的洞察力/阅读本身我都会对它感兴趣!

回答

3

好像你正在使用代替双引号"

顺便说一句,一个更好的做法是环路上的所有表中的行会是这样的:

for tr in sel.xpath("//tr"): 
    item = SteamItem() 
    item ['title'] = tr.xpath('td[2]/a/text()').extract() 
    item ['price'] = tr.xpath('td[@class="price-final"]/text()').extract() 
    yield item 
+0

这看起来很简单得多,它的工作就像一个梦。你是如何学习scrapy的?书籍/教程? – Lorienas