使用scrapy导出以多种格式抓取数据

我在抓取网站以将数据导出为语义格式（n3）。不过，我也想对这些数据进行一些数据分析，所以以csv格式使它更方便。使用scrapy导出以多种格式抓取数据

要获得这两种格式的数据，我可以做

scrapy spider -t n3 -o data.n3 
scrapy spider -t csv -o data.csv

然而，这刮擦数据两次，我不能与大数据量的负担得起。

有没有办法将相同的刮取数据导出为多种格式？（无需多次下载数据）

我觉得有趣的是可以导出为不同格式的抓取数据的中间表示形式。但似乎没有办法与scrapy做到这一点。

来源

2015-06-24 kiril

作为alecxe的建议，我张贴在scrapy的github上https://github.com/scrapy/scrapy/issues/1336 – kiril

从我了解的源代码和文档后，-t option refers to the FEED_FORMAT setting不能有多个值。此外，FeedExporter内置分机（source）仅适用于单一出口商。

其实，想想功能要求在Scrapy Issue Tracker。

随着越来越像一个解决方法，定义管道，并开始有多个出口出口。例如，这里是如何导出为CSV和JSON格式：

from collections import defaultdict 

from scrapy import signals 
from scrapy.exporters import JsonItemExporter, CsvItemExporter 


class MyExportPipeline(object): 
    def __init__(self): 
     self.files = defaultdict(list) 

    @classmethod 
    def from_crawler(cls, crawler): 
     pipeline = cls() 
     crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) 
     crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) 
     return pipeline 

    def spider_opened(self, spider): 
     csv_file = open('%s_products.csv' % spider.name, 'w+b') 
     json_file = open('%s_products.json' % spider.name, 'w+b') 

     self.files[spider].append(csv_file) 
     self.files[spider].append(json_file) 

     self.exporters = [ 
      JsonItemExporter(json_file), 
      CsvItemExporter(csv_file) 
     ] 

     for exporter in self.exporters: 
      exporter.start_exporting() 

    def spider_closed(self, spider): 
     for exporter in self.exporters: 
      exporter.finish_exporting() 

     files = self.files.pop(spider) 
     for file in files: 
      file.close() 

    def process_item(self, item, spider): 
     for exporter in self.exporters: 
      exporter.export_item(item) 
     return item

来源

2015-06-24 18:02:09 alecxe

功能请求确定，这是合适的解决方案，但我希望能够使用参数配置整个导出。这样，我只能编辑'settings.py'来更改导出配置。 – kiril

使用scrapy导出以多种格式抓取数据

回答

相关问题