2015-08-21 56 views
1

我尝试运行在我的Docker容器中启动许多蜘蛛的bash脚本。 我supervisor.conf是放在 “/etc/supervisor/conf.d/” looke这样的:已退出:scrapy(退出状态0;未预期)

[program:scrapy]                
command=/tmp/start_spider.sh 
autorestart=false 
startretries=0 
stderr_logfile=/tmp/start_spider.err.log 
stdout_logfile=/tmp/start_spider.out.log 

但主管返回此错误:

2015年8月21日10:50:30466 CRIT监事以root身份运行(无用户在 配置文件)

2015年8月21日10:50:30466解析

期间WARN包括额外的文件 “/etc/supervisor/conf.d/tor.conf”

2015年8月21日10:50:30478 INFO RPC接口 '主管' 初始化

2015年8月21日10:50:30478 CRIT服务器 'unix_http_server' 而不 运行任何HTTP认证检查

2015 -08-21 10:50:30478 INFO supervisord开始使用PID 5

2015年8月21日10:50:31481 INFO催生: 'scrapy' 具有pid 8

2015年8月21日10时50分:31,555 INFO退出:scrapy(退出状态0;不 预期)

2015年8月21日10:50:32557 INFO放弃:scrapy进入FATAL状态,也 许多开始试太快

而我的程序停止运行。但如果我手动运行我的程序,它的工作原理非常好...

如何解决此问题?有任何想法吗?

+0

什么start_spider.sh看起来不一样? – Michael

回答

2

我找到了解决我的问题的方法。对于supervisor.conf,改变

[program:scrapy]              
     command=/tmp/start_spider.sh 
     autorestart=false 
     startretries=0 

由:

[program:scrapy] 
command=/bin/bash -c "exec /tmp/start_spider.sh > /dev/null 2>&1 -DFOREGROUND" 
autostart=true 
autorestart=false 
startretries=0 
0

这里是我的代码:

start_spider.sh

#!/bin/bash 

# list letter 
parseLetter=('a' 'b') 


# change path 
cd $path/scrapy/scrapyTodo/scrapyTodo 

tLen=${#parseLetter[@]} 
for ((i=0; i<${tLen}; i++)); 
do 
    scrapy crawl root -a alpha=${parseLetter[$i]} & 
done 

这里是我的scrapy代码:

#!/usr/bin/python -tt 
# -*- coding: utf-8 -*- 

from scrapy.selector import Selector 
from elasticsearch import Elasticsearch 
from scrapy.contrib.spiders import CrawlSpider 
from scrapy.http import Request 
from urlparse import urljoin 
from bs4 import BeautifulSoup 
from scrapy.spider import BaseSpider 
from bs4 import BeautifulSoup 
from tools import sendEmail 
from tools import ElasticAction 
from tools import runlog 
from scrapy import signals 
from scrapy.xlib.pydispatch import dispatcher 
from datetime import datetime 
import re 

class studentCrawler(BaseSpider): 
    # Crawling Start 
    CrawlSpider.started_on = datetime.now() 

    name = "root" 


    DOWNLOAD_DELAY = 0 

    allowed_domains = ['website.com'] 

    ES_Index = "website" 
    ES_Type = "root" 
    ES_Ip = "127.0.0.1" 

    child_type = "level1" 

    handle_httpstatus_list = [404, 302, 503, 999, 200] #add any other code you need 

    es = ElasticAction(ES_Index, ES_Type, ES_Ip) 

    # Init 
    def __init__(self, alpha=''): 

     base_domain = 'https://www.website.com/directory/student-' + str(alpha) + "/" 

     self.start_urls = [base_domain] 
     super(CompanyCrawler, self).__init__(self.start_urls) 


    def is_empty(self, any_structure): 
     """ 
     Function that allow to check if the data is empty or not 
     :arg any_structure: any data 
     """ 
     if any_structure: 
      return 1 
     else: 
      return 0 

    def parse(self, response): 
     """ 
     main method that parse the web page 
     :param response: 
     :return: 
     """ 

     if response.status == 404: 
      self.es.insertIntoES(response.url, "False") 
     if str(response.status) == "503": 
      self.es.insertIntoES(response.url, "False") 
     if response.status == 999: 
      self.es.insertIntoES(response.url, "False") 

     if str(response.status) == "200": 
      # Selector 
      sel = Selector(response) 

      self.es.insertIntoES(response.url, "True") 
      body = self.getAllTheUrl('u'.join(sel.xpath(".//*[@id='seo-dir']/div/div[3]").extract()).strip(),response.url) 


    def getAllTheUrl(self, data, parent_id): 
     dictCompany = dict() 
     soup = BeautifulSoup(data,'html.parser') 
     for a in soup.find_all('a', href=True): 
      self.es.insertChildAndParent(self.child_type, str(a['href']), "False", parent_id) 

我发现BeautifulSoup不工作当蜘蛛被监督者启动。 ...