3
,我不能让程序创建导出文件脚本Scrapy。不出口,我想从脚本运行scrapy数据
我试图让该文件以两种不同的方式导出:
- 随着管道
- 随着饲料出口。
当我从命令行运行scrapy时,这两种方式都有效,但是当我从脚本运行scrapy时,这两种方式都不起作用。
我并不孤单这个问题。这里还有两个类似的未解答的问题。直到我发布问题之后,我才注意到这些。
- JSON not working in scrapy when calling spider through a python script?
- Calling scrapy from a python script not creating JSON output file
这里是我的代码从脚本运行scrapy。它包括用管道和Feed导出器打印输出文件的设置。
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
import logging
from external_links.spiders.test import MySpider
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
#manually set settings here
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('reactor running...')
reactor.run()
log.msg('Reactor stopped...')
在后,我运行这段代码的日志说:“存储CSV饲料(341个项目)中:output.csv”,但没有被发现output.csv。
这里是我的饲料出口国码:
settings = get_project_settings()
#manually set settings here
settings.set('ITEM_PIPELINES', {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
from scrapy.contrib.exporter import CsvItemExporter
class CsvOptionRespectingItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
delimiter = settings.get('CSV_DELIMITER', ',')
kwargs['delimiter'] = delimiter
super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
这里是我的代码管道:
import csv
class CsvWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('items2.csv', 'wb'))
def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])
return item
你有没有想出解决办法? – ccdpowell