2013-07-04 64 views
6

我设计了一个爬行器,其中将有两个蜘蛛。我使用scrapy设计了这些爬虫。
这些蜘蛛将通过从数据库中提取数据而独立运行。scrapy中的端口错误

我们正在运行这些蜘蛛使用reactor.As我们知道,我们不能反复运行反应堆
我们给了第500蜘蛛爬行的约500多个链接。 如果我们这样做,我们有一个端口错误的问题。即scrapy只使用单个端口

Error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.TelnetConsole instance at 0x0467B440>> 
Traceback (most recent call last): 
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks 
result = g.send(result) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\core\engine.py", line 75, in start yield self.signals.send_catch_log_deferred(signal=signals.engine_started) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\signalmanager.py", line 23, in send_catch_log_deferred 
return signal.send_catch_log_deferred(*a, **kw) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\utils\signal.py", line 53, in send_catch_log_deferred 
*arguments, **named) 
--- <exception caught here> --- 
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 137, in maybeDeferred 
result = f(*args, **kw) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\xlib\pydispatch\robustapply.py", line 47, in robustApply 
return receiver(*arguments, **named) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\telnet.py", line 47, in start_listening 
self.port = listen_tcp(self.portrange, self.host, self) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\utils\reactor.py", line 14, in listen_tcp 
return reactor.listenTCP(x, factory, interface=host) 
File "C:\Python27\lib\site-packages\twisted\internet\posixbase.py", line 489, in listenTCP 
p.startListening() 
File "C:\Python27\lib\site-packages\twisted\internet\tcp.py", line 980, in startListening 
raise CannotListenError(self.interface, self.port, le) 
twisted.internet.error.CannotListenError: Couldn't listen on 0.0.0.0:6073: [Errno 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted. 

那么这里发生了什么问题?那么解决这种情况的最佳方法是什么??请帮助...

p.s:我已经增加了端口的数量,但总是以6073为默认值。

+0

你能展示你如何运行你的蜘蛛,你如何配置它们? – alecxe

+1

这是一个副本http://stackoverflow.com/questions/1767553/twisted-errors-in-scrapy-spider –

+0

@ Jean-PaulCalderone没有不一样的,我已禁用Web和Telnet控制台,但它显示相同的错误。 – sathish

回答

1

您的问题可以通过运行较少的并发爬网程序来解决。下面是我为顺序发出请求而写的一个配方: 这个特定的类只运行一个爬行器,但是使它运行批处理(一次10个)所需的修改是微不足道的。

class SequentialCrawlManager(object): 
    """Start spiders sequentially""" 

    def __init__(self, spider, websites): 
     self.spider = spider 
     self.websites = websites 
     # setup crawler 
     self.settings = get_project_settings() 
     self.current_site_idx = 0 

    def next_site(self): 
     if self.current_site_idx < len(self.websites): 
      self.crawler = Crawler(self.settings) 
      # the CSVs data in each column is passed as keyword arguments 
      # the arguments come from the 
      spider = self.spider() # pass arguments if desired 
      self.crawler.crawl(spider) 
      self.crawler.start() 
      # wait for one spider to finish before starting the next one 
      self.crawler.signals.connect(self.next_site, signal=signals.spider_closed) 
      self.crawler.configure() 
      self.current_site_idx += 1 
     else: 
      reactor.stop() # required for the program to terminate 

    def start(self): 
     log.start() 
     self.next_site() 
     reactor.run() # blocking call