从脚本运行2个连续Scrapy CrawlerProcess使用不同的设置

我有2个不同的Scrapy蜘蛛目前当启动工作：从脚本运行2个连续Scrapy CrawlerProcess使用不同的设置

scrapy crawl spidername -o data\whatever.json

当然，我知道我可以使用系统调用从脚本复制只是命令，但我更愿意坚持使用CrawlerProcess或使用其他脚本进行工作的其他方法。

的事情是：在this SO question阅读（也Scrapy文档），我必须设置在给予CrawlerProcess构造函数设置输出文件：

process = CrawlerProcess({ 
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
'FEED_FORMAT': 'json', 
'FEED_URI': 'data.json' 
})

的问题是，我不不希望两个蜘蛛都将数据存储到同一个输出文件中，而是两个不同的文件。所以，我的第一次尝试显然是创建一个新的CrawlerProcess使用不同的设置时，第一份工作是做：

session_date_format = '%Y%m%d' 
session_date = datetime.now().strftime(session_date_format) 

try: 
    process = CrawlerProcess({ 
     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
     'FEED_FORMAT': 'json', 
     'FEED_URI': os.path.join('data', 'an_origin', '{}.json'.format(session_date)), 
     'DOWNLOAD_DELAY': 3, 
     'LOG_STDOUT': True, 
     'LOG_FILE': 'scrapy_log.txt', 
     'ROBOTSTXT_OBEY': False, 
     'RETRY_ENABLED': True, 
     'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408], 
     'RETRY_TIMES': 5 
    }) 
    process.crawl(MyFirstSpider) 
    process.start() # the script will block here until the crawling is finished 
except Exception as e: 
    print('ERROR while crawling: {}'.format(e)) 
else: 
    print('Data successfuly crawled') 

time.sleep(3) # Wait 3 seconds 

try: 
    process = CrawlerProcess({ 
     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
     'FEED_FORMAT': 'json', 
     'FEED_URI': os.path.join('data', 'other_origin', '{}.json'.format(session_date)), 
     'DOWNLOAD_DELAY': 3, 
     'LOG_STDOUT': True, 
     'LOG_FILE': 'scrapy_log.txt', 
     'ROBOTSTXT_OBEY': False, 
     'RETRY_ENABLED': True, 
     'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408], 
     'RETRY_TIMES': 5 
    }) 
    process.crawl(MyOtherSpider) 
    process.start() # the script will block here until the crawling is finished 
except Exception as e: 
    print('ERROR while crawling: {}'.format(e)) 
else: 
    print('Data successfuly crawled')

当我做到这一点，首先Crawler按预期工作。但是，第二个创建一个空的输出文件并失败。如果我将第二个CrawlerProcess存储到不同变量中，也会发生这种情况，例如process2。显然，我尝试改变蜘蛛的顺序来检查这是否是特定蜘蛛的问题，但是失败的总是一直是第二位的。

如果我检查日志文件，第一份工作完成后，似乎2个Scrapy机器人启动，所以也许奇怪的事情正在发生：

2017-05-29 23:51:41 [scrapy.extensions.feedexport] INFO: Stored json feed (2284 items) in: data\one_origin\20170529.json 
2017-05-29 23:51:41 [scrapy.core.engine] INFO: Spider closed (finished) 
2017-05-29 23:51:41 [stdout] INFO: Data successfuly crawled 
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot) 
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot) 
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'scrapy_output.txt', 'FEED_FORMAT': 'json', 'FEED_URI': 'data\\other_origin\\20170529.json', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'LOG_STDOUT': True, 'RETRY_TIMES': 5, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408], 'DOWNLOAD_DELAY': 3} 
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'scrapy_output.txt', 'FEED_FORMAT': 'json', 'FEED_URI': 'data\\other_origin\\20170529.json', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'LOG_STDOUT': True, 'RETRY_TIMES': 5, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408], 'DOWNLOAD_DELAY': 3} 
... 
2017-05-29 23:51:44 [scrapy.core.engine] INFO: Spider opened 
2017-05-29 23:51:44 [scrapy.core.engine] INFO: Spider opened 
2017-05-29 23:51:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-05-29 23:51:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-05-29 23:51:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024 
2017-05-29 23:51:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024 
2017-05-29 23:51:44 [stdout] INFO: ERROR while crawling: 
2017-05-29 23:51:44 [stdout] INFO: ERROR while crawling:

发生了什么，以及如何解决的任何想法这个？

来源

2017-05-30 Roman Rdgz

你可以看看我的https://stackoverflow.com/a/42512653/2572383，它使用'CrawlerRunner' –

将

process.start()

在你的脚本，你都刮的很年底将在同一时间运行。

PS：我已经做了这样的事情。

这是我分享的一小段代码。

batches = 10 
while batches > 0: 
    process = CrawlerProcess(SETTINGS HERE) 
    process.crawl(AmazonSpider()) 
    batches = batches - 1 

process.start() # then finally run your Spiders.

来源

2017-05-30 09:15:31 Umair

如果我这样做，那么我不能对每个蜘蛛设置不同的设置，我需要这样做，如果我想要不同的输出文件 –

@RomanRdgz看到编辑的代码...我想你可以设置这样的设置 – Umair

从脚本运行2个连续Scrapy CrawlerProcess使用不同的设置

回答

相关问题