如何使用Scrapy获得结构化的JSON输出？

我是新手到Python，最近我尝试使用Scrapy刮具有多页的网站，下面是我的“spider.py”如何使用Scrapy获得结构化的JSON输出？

def parse(self, response): 
     sel = Selector(response) 
     tuples = sel.xpath('//*[td[@class = "caption"]]') 
     items = [] 

     for tuple in tuples: 
      item = DataTuple() 

      keyTemp = tuple.xpath('td[1]').extract()[0] 
      key = html2text.html2text(keyTemp).rstrip() 
      valueTemp = tuple.xpath('td[2]').extract()[0] 
      value = html2text.html2text(valueTemp).rstrip() 

      item[key] = value 
      items.append(item) 
    return items

代码段通过与命令运行的代码：

scrapy crawl dumbSpider -o items.json -t json

它会发出：

{"a":"a-Value"}, 
{"b":"b-Value"}, 
{"c":"c-Value"}, 
{"a":"another-a-Value"}, 
{"b":"another-b-Value"}, 
{"c":"another-c-Value"}

但其实我是想是这样的：

{"a":"a-Value", "b":"b-Value", "c":"c-Value"}, 
{"a":"another-a-Value", "b":"another-b-Value", "c":"another-c-Value"}

我尝试了一些方法来调整spider.py例如使用临时列表来存储单个网页的所有“项目”，然后将临时列表附加到“项目”，但不知何故它不起作用。

已更新：缩进是固定的。

来源

2016-01-24 mightyheptagon

考虑建立在个案第一个将充满你的分字典新的两个字典，直到它会发现，特别是关键的，比如'了'已经存在。如果发生这种情况 - 创建新的字典并执行相同的操作。 – PatNowak

@PatNowak感谢您的评论！但是该网站上显示的数据过于灵活而无法监控。我实际上无法知道我什么时候会在特定页面中接近尾声。 – mightyheptagon

它总是为了？我的意思是它总是以3为单位，你想要第一个3，然后是其他3等等？ – eLRuLL

下面我已经做了一个快速的模型，我只要知道每页的TD数量，我就会推荐这么做。如果您认为合适，您可以采取一些或全部措施。这可能是为你的问题过度设计的（对不起！）;你可以取适量chunk_by_numbers位和做....

有几件事情需要注意：

1）避免使用“元组”作为一个变量名，因为它也是一个内部关键字

2）学习使用generator/built-ins，因为如果你一次做很多站点，它们会更快更轻（参见下面的parse_to_kv和chunk_by_number）

3）尝试隔离解析逻辑，的变化，您可以轻松地在一个地方换掉（参见下面的extract_td）

4）你的函数不使用'self'，所以你应该使用@staticmethod装饰器并从函数中删除这个参数

5）目前输出是字典，但你可以导入json和dump它，如果你需要一个JSON对象

def extract_td(item, index): 
    # extract logic for my websites which allows extraction 
    # of either a key or value from a table data 
    # returns a string representation of item[index] 
    # this is very page/tool specific! 
    td_as_str = "td[%i]" % index 
    val = item.xpath(td_as_str).extract()[0] 
    return html2text.html2text(val).rstrip() 

def parse_to_kv(xpaths): 
    # returns key, value pairs from the given 
    # this is also page specific 
    for xpath in xpaths: 
     yield extract_td(xpath, 0), extract_td(xpath, 1) 

def chunk_by_number(alist, num): 
    # splices alist into chunks of num size. 
    # This is a very generic, reusable operation 
    for chunk in list(zip(*(iter(alist),) * num)): 
     yield chunk 

def parse(response, td_per_page): 
    # extracts key/value pairs based on the table datas in response 
    # yields lists of length td_per_page which contain these key/value extractions 
    # this is very specific based on our parse patterns 
    sel = Selector(response) 
    tuples = sel.xpath('//*[td[@class = "caption"]]') 
    kv_generator = parse_to_kv(tuples) 

    for page in chunk_by_number(kv_generator, td_per_page): 
     print dict(page)

来源

2016-01-24 22:27:52

如何使用Scrapy获得结构化的JSON输出？

回答

相关问题