2014-12-19 49 views
3

,我不能让程序创建导出文件脚本Scrapy。不出口,我想从脚本运行scrapy数据

我试图让该文件以两种不同的方式导出:

  1. 随着管道
  2. 随着饲料出口。

当我从命令行运行scrapy时,这两种方式都有效,但是当我从脚本运行scrapy时,这两种方式都不起作用。

我并不孤单这个问题。这里还有两个类似的未解答的问题。直到我发布问题之后,我才注意到这些。

  1. JSON not working in scrapy when calling spider through a python script?
  2. Calling scrapy from a python script not creating JSON output file

这里是我的代码从脚本运行scrapy。它包括用管道和Feed导出器打印输出文件的设置。

from twisted.internet import reactor 

from scrapy import log, signals 
from scrapy.crawler import Crawler 
from scrapy.xlib.pydispatch import dispatcher 
import logging 

from external_links.spiders.test import MySpider 
from scrapy.utils.project import get_project_settings 
settings = get_project_settings() 

#manually set settings here 
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline') 
settings.set('DEPTH_LIMIT',1,priority='cmdline') 
settings.set('LOG_FILE','Log.log',priority='cmdline') 
settings.set('FEED_URI','output.csv',priority='cmdline') 
settings.set('FEED_FORMAT', 'csv',priority='cmdline') 
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline') 
settings.set('FEED_STORE_EMPTY',True,priority='cmdline') 

def stop_reactor(): 
    reactor.stop() 

dispatcher.connect(stop_reactor, signal=signals.spider_closed) 
spider = MySpider() 
crawler = Crawler(settings) 
crawler.configure() 
crawler.crawl(spider) 
crawler.start() 
log.start(loglevel=logging.DEBUG) 
log.msg('reactor running...') 
reactor.run() 
log.msg('Reactor stopped...') 

在后,我运行这段代码的日志说:“存储CSV饲料(341个项目)中:output.csv”,但没有被发现output.csv。

这里是我的饲料出口国码:

settings = get_project_settings() 

#manually set settings here 
settings.set('ITEM_PIPELINES', {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline') 
settings.set('DEPTH_LIMIT',1,priority='cmdline') 
settings.set('LOG_FILE','Log.log',priority='cmdline') 
settings.set('FEED_URI','output.csv',priority='cmdline') 
settings.set('FEED_FORMAT', 'csv',priority='cmdline') 
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline') 
settings.set('FEED_STORE_EMPTY',True,priority='cmdline') 


from scrapy.contrib.exporter import CsvItemExporter 


class CsvOptionRespectingItemExporter(CsvItemExporter): 

    def __init__(self, *args, **kwargs): 
     delimiter = settings.get('CSV_DELIMITER', ',') 
     kwargs['delimiter'] = delimiter 
     super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs) 

这里是我的代码管道:

import csv 

class CsvWriterPipeline(object): 

def __init__(self): 
    self.csvwriter = csv.writer(open('items2.csv', 'wb')) 

def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object 
    self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']]) 

    return item 
+0

你有没有想出解决办法? – ccdpowell

回答

1

我有同样的问题。

这里是我什么工作:

  1. 把出口URI为settings.py

    FEED_URI='file:///tmp/feeds/filename.jsonlines'

  2. ,内容如下

    
    from scrapy.crawler import CrawlerProcess 
    from scrapy.utils.project import get_project_settings 
    
    
    process = CrawlerProcess(get_project_settings()) 
    
    process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project. 
    process.start() # the script will block here until the crawling is finished 
    
    
  3. 创建旁边的 scrapy.cfg一个 scrape.py脚本
  4. 运行:python scrape.py

结果:文件被创建。

注意:我没有在我的项目管道。所以不知道管道是否会过滤或不是你的结果。

:这是在docs常见的陷阱节这让我​​

相关问题