2015-06-26 31 views
3

我只提及在发布此问题之前我已经提及的一些问题(在发布此问题之前,我目前没有指向所有这些问题的链接) - :从Python脚本向Scrapy Spider传递参数

我能够完全运行这段代码,如果我不传递参数,并要求输入FR om用户从BBSpider类(没有main函数 - 位于name =“dmoz”行下面的ust)中,或者将它们提供为预定义(即静态)参数。

我的代码是here

我基本上试图从Python脚本执行Scrapy蜘蛛,而不需要任何额外的文件(甚至是设置文件)。这就是为什么,我也在代码本身中指定了设置。

这是我在执行这个脚本 - 获取输出:

http://bigbasket.com/ps/?q=apple 
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11 
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {} 
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
None 
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened 
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request 
    request = next(slot.start_requests) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests 
    yield self.make_requests_from_url(url) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url 
    return Request(url, dont_filter=True) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ 
    self._set_url(url) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url 
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__) 
TypeError: Request url must be str or unicode, got NoneType: 
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished) 
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543), 
'log_count/DEBUG': 1, 
'log_count/ERROR': 1, 
'log_count/INFO': 7, 
'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)} 
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished) 

我目前facing-的问题:

  • 如果你仔细看1号线和6号线我输出的start_url被打印了两次,即使我已经在我的代码的第31行写了打印语句一次(上面给出的链接)。为什么会发生这种情况,并且这也是不同的值(第一行(输出结果)的初始打印语句输出给出了正确的结果,尽管第六行(输出结果)的print语句输出?不仅如此,即使我如果您看到我的输出的这一行 - : TypeError:请求url必须是str或unicode,得到NoneType:请求url必须是str或unicode, (即使我上面发布的问题的链接写了同样的东西)?我不知道如何解决它?我甚至尝试过`self.start_urls = [str(kwargs.get( 'start_url'))] - 然后它给出以下输出:
http://bigbasket.com/ps/?q=apple 
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11 
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {} 
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
None 
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened 
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request 
    request = next(slot.start_requests) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests 
    yield self.make_requests_from_url(url) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url 
    return Request(url, dont_filter=True) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ 
    self._set_url(url) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: None 
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished) 
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350), 
'log_count/DEBUG': 1, 
'log_count/ERROR': 1, 
'log_count/INFO': 7, 
'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)} 
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished) 

请帮我解决上述2个错误。

+0

你检查这个答案? [如何从Python脚本中运行Scrapy](http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script) – eLRuLL

+0

@eLRuLL:是的,我有检查他们。首先,在那里没有提及蜘蛛类需要做什么改变(这是我的问题的主要核心 - 我列出的两个问题都只在于代码的那部分)。另一件事,他们说的是我调用蜘蛛爬行时所做的完全类似的事情(如果你看到我的代码的话)。请让我知道如何解决这个问题!谢谢! –

回答

6

你需要传递的CrawlerProcesscrawl方法的参数,所以你需要这样运行:

crawler = CrawlerProcess(Settings()) 
crawler.crawl(BBSpider, start_url=url) 
crawler.start() 
+0

感谢它确实工作。只是一个澄清和疑问。为什么发生了这个问题1(就像它打印了两次)?还有疑问 - 如果我想要使用多处理库并行执行2个蜘蛛,我可以像这样传递一个队列,然后使用queue.put(items),然后最后从主函数中获取蜘蛛的输出脚本使用queue.get()方法。有可能这样做吗?你能给我一个这样的示例代码吗?如果你能为我提供这些代码,那将会非常感谢你。谢谢,请提供该代码。 –

+0

以及重复的打印发生是因为您在调用爬网程序之前实例化了Spider对象,因此这是第一次打印,然后您在爬网程序上传递了一个Spider实例,该实例没有获取任何参数,因此这是第二次打印。关于第二个,我认为这可能是可能的,但现在我没有一个例子抱歉。 – eLRuLL

+0

非常感谢您的回复。你清除了我的怀疑。对于第二部分,你能否通过为我提供多处理代码(使用python - 多处理库)来处理2个不同的start_urls的相同BBSpider类的2个蜘蛛?我尝试过,但它给了我一些奇怪的错误。如果你能为我提供代码,这将是非常棒的!如果你能提供代码,我会很感激你。请尽量给代码。谢谢! –