2017-07-18 149 views
0

我正在学习如何使用Scrapy,同时刷新我在Python中的知识?/来自学校的编码。Scrapy - 创建嵌套的JSON对象

目前,我正在玩imdb top 250列表,但与JSON输出文件挣扎。

我当前的代码是:

# -*- coding: utf-8 -*- 
import scrapy 

from top250imdb.items import Top250ImdbItem 


class ActorsSpider(scrapy.Spider): 
    name = "actors" 
    allowed_domains = ["imdb.com"] 
    start_urls = ['http://www.imdb.com/chart/top'] 

    # Parsing each movie and preparing the url for the actors list 
    def parse(self, response): 
     for film in response.css('.titleColumn'): 
      url = film.css('a::attr(href)').extract_first() 
      actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast' 
      yield scrapy.Request(actors_url, self.parse_actor) 

    # Finding all actors and storing them on item 
    # Refer to items.py 
    def parse_actor(self, response): 
     final_list = [] 
     item = Top250ImdbItem() 
     item['poster'] = response.css('#main img::attr(src)').extract_first() 
     item['title'] = response.css('h3[itemprop~=name] a::text').extract() 
     item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract() 
     item['actors'] = response.css('td[itemprop~=actor] span::text').extract() 

     final_list.append(item) 

     updated_list = [] 

     for item in final_list: 
      for i in range(len(item['title'])): 
       sub_item = {} 
       sub_item['movie'] = {} 
       sub_item['movie']['poster'] = [item['poster']] 
       sub_item['movie']['title'] = [item['title'][i]] 
       sub_item['movie']['photo'] = [item['photo']] 
       sub_item['movie']['actors'] = [item['actors']] 
       updated_list.append(sub_item) 
      return updated_list 

和我的输出文件给我这个JSON组成:

[ 
    { 
    "movie": { 
     "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
     "title": ["The Shawshank Redemption"], 
     "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
     "actors": [["Tim Robbins","Morgan Freeman",...]]} 
    },{ 
    "movie": { 
     "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
     "title": ["The Godfather"], 
     "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
     "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]} 
    } 
] 

但我正在寻找实现这一目标:

{ 
    "movies": [{ 
    "poster": "https://images-na.ssl-images-amazon.com/poster...", 
    "title": "The Shawshank Redemption", 
    "actors": [ 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Tim Robbins"}, 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Morgan Freeman"},... 
    ] 
    },{ 
    "poster": "https://images-na.ssl-images-amazon.com/poster...", 
    "title": "The Godfather", 
    "actors": [ 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Marlon Brando"}, 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Al Pacino"},... 
    ] 
    }] 
} 

在我items.py文件中我有以下内容:

import scrapy 


class Top250ImdbItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 

    # Items from actors.py 
    poster = scrapy.Field() 
    title = scrapy.Field() 
    photo = scrapy.Field() 
    actors = scrapy.Field() 
    movie = scrapy.Field() 
    pass 

我知道下面的事情:

  1. 我的结果不是为了出来,网页列表中的第一个电影永远是第一次拍电影对我的输出文件,但其余的是不。我仍在努力。

  2. 我可以做同样的事情,但使用Top250ImdbItem(),仍然浏览如何以更详细的方式完成。

  3. 这可能不是我的JSON的完美布局,欢迎提出建议,或者如果是,请告诉我,即使我知道没有完美的方式或“唯一的方式”。

  4. 一些演员没有照片,它实际上加载了不同的CSS选择器。现在,我想避免伸手去看“无图片缩略图”,因此可以将这些项目留空。

例如:

{"photo": "", "name": "Al Pacino"} 
+0

不要使用'(scrapy.Item)'使用'dict'与'电影开始:[] '。 – stovfl

+0

嘿,@stovfl能否详细说明一下。 – ricardoNava

回答

0

Question: ... struggling with a JSON output file


Note: Can't use your ActorsSpider , get Error: Pseudo-elements are not supported.

# Define a `dict` **once** 
top250ImdbItem = {'movies': []} 

def parse_actor(self, response): 
    poster = response.css(... 
    title = response.css(... 
    photos = response.css(... 
    actors = response.css(... 

    # Assuming List of Actors are in sync with List of Photos 
    actors_list = [] 
    for i, actor in enumerate(actors): 
     actors_list.append({"name": actor, "photo": photos[i]}) 

    one_movie = {"poster": poster, 
       "title": title, 
       "actors": actors_list 
       } 

    # Append One Movie to Top250 'movies' List 
    top250ImdbItem['movies'].append(one_movie) 
+0

好吧,我会检查,它有点儿奇怪,你不能运行它,我实际上仍然使用完全相同的代码,我也会检查这个问题,并更新,看看你是否可以运行它,我会尝试这些建议,实际上没有照片和演员不同步,仍然搞清楚如何去做,但你的帮助其实很棒。 – ricardoNava

+0

我是否应该将修改过的工作代码作为评论发布在此处,编辑当前的代码还是保留原样? – ricardoNava

+0

[编辑]你的问题,并只添加更改的部分 – stovfl