2014-01-05 152 views
1

我试图写它通过以下JSON响应抓取蜘蛛Scrapy蜘蛛: http://gdata.youtube.com/feeds/api/standardfeeds/UK/most_popular?v=2&alt=json的JSON响应

如何将蜘蛛的样子,如果我想抓取的视频所有的标题?我的所有蜘蛛都不工作。

from scrapy.spider import BaseSpider 
import json 
from youtube.items import YoutubeItem 
class MySpider(BaseSpider): 
    name = "youtubecrawler" 
    allowed_domains = ["gdata.youtube.com"] 
    start_urls = ['http://www.gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json'] 

    def parse(self, response): 
     items [] 
    jsonresponse = json.loads(response) 
    for video in jsonresponse["feed"]["entry"]: 
     item = YoutubeItem() 
     print jsonresponse 
     print video["media$group"]["yt$videoid"]["$t"] 
     print video["media$group"]["media$description"]["$t"] 
     item ["title"] = video["title"]["$t"] 
     print video["author"][0]["name"]["$t"] 
     print video["category"][1]["term"] 
     items.append(item) 
    return items 

我总是得到以下错误:

2014-01-05 16:55:21+0100 [youtubecrawler] ERROR: Spider error processing <GET http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json> 
     Traceback (most recent call last): 
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop 
      self.runUntilCurrent() 
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent 
      call.func(*call.args, **call.kw) 
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback 
      self._startRunCallbacks(result) 
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks 
      self._runCallbacks() 
     --- <exception caught here> --- 
      File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks 
      current.result = callback(current.result, *args, **kw) 
      File "/home/bxxxx/svn/ba_txxxxx/scrapy/youtube/spiders/test.py", line 15, in parse 
      jsonresponse = json.loads(response) 
      File "/usr/lib/python2.7/json/__init__.py", line 326, in loads 
      return _default_decoder.decode(s) 
      File "/usr/lib/python2.7/json/decoder.py", line 365, in decode 
      obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
     exceptions.TypeError: expected string or buffer 

回答

4

发现代码中的两个问题:

  1. 起始URL是无法访问的,我从中拿出www
  2. 改变json.loads(response)json.loads(response.body_as_unicode())

这很适合我:

class MySpider(BaseSpider): 
    name = "youtubecrawler" 
    allowed_domains = ["gdata.youtube.com"] 
    start_urls = ['http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json'] 

    def parse(self, response): 
     items = [] 
     jsonresponse = json.loads(response.body_as_unicode()) 
     for video in jsonresponse["feed"]["entry"]: 
      item = YoutubeItem() 
      print video["media$group"]["yt$videoid"]["$t"] 
      print video["media$group"]["media$description"]["$t"] 
      item ["title"] = video["title"]["$t"] 
      print video["author"][0]["name"]["$t"] 
      print video["category"][1]["term"] 
      items.append(item) 
     return items 
+1

'body_as_unicode'已被弃用,看到https://doc.scrapy.org/en/latest/topics/request-response.html?highlight=body_as_unicode#scrapy .http.TextResponse.body_as_unicode – timfeirg

+1

对!感谢@timfeirg未来的读者,请使用'response.text'代替 –