2013-12-12 88 views
1

我正在使用Scrapy找到一个学校项目,以查找死链接和缺页。我已经编写了管道,用于写入带有相关刮取信息的文本文件。我在计算蜘蛛运行结束时如何发送电子邮件以及作为附件创建的文件时遇到了麻烦。抓取网站后发送带附件的电子邮件

Scrapy具有内置的电子邮件功能,并触发信号时,蜘蛛完成,但在某种程度上,这是明智的是躲避我一起得到的一切。任何帮助将不胜感激。

这里是我的创建与刮数据文件管道:

class saveToFile(object): 

def __init__(self): 
    # open files 
    self.old = open('old_pages.txt', 'wb') 
    self.date = open('pages_without_dates.txt', 'wb') 
    self.missing = open('missing_pages.txt', 'wb') 

    # write table headers 
    line = "{0:15} {1:40} {2:} \n\n".format("Domain","Last Updated","URL") 
    self.old.write(line) 

    line = "{0:15} {1:} \n\n".format("Domain","URL") 
    self.date.write(line) 

    line = "{0:15} {1:70} {2:} \n\n".format("Domain","Page Containing Broken Link","URL of Broken Link") 
    self.missing.write(line) 

def process_item(self, item, spider): 

    # add items to file as they are scraped 
    if item['group'] == "Old Page": 
     line = "{0:15} {1:40} {2:} \n".format(item['domain'],item["lastUpdated"],item["url"]) 
     self.old.write(line) 
    elif item['group'] == "No Date On Page": 
     line = "{0:15} {1:} \n".format(item['domain'],item["url"]) 
     self.date.write(line) 
    elif item['group'] == "Page Not Found": 
     line = "{0:15} {1:70} {2:} \n".format(item['domain'],item["referrer"],item["url"]) 
     self.missing.write(line) 

    return item 

我想发送的电子邮件创建一个单独的管道项目。我至今如下:

class emailResults(object): 

def __init__(self): 

    dispatcher.connect(self.spider_closed, spider_closed) 
    dispatcher.connect(self.spider_opened, spider_opened) 

    old = open('old_pages.txt', 'wb') 
    date = open('pages_without_dates.txt', 'wb') 
    missing = open('missing_pages.txt', 'wb') 
    oldOutput = open('twenty_oldest_pages.txt', 'wb') 

attachments = [ 
      ("old_pages", "text/plain", old) 
      ("date", "text/plain", date) 
      ("missing", "text/plain", missing) 
      ("oldOutput", "text/plain", oldOutput) 
     ] 

     self.mailer = MailSender() 
def spider_closed(SPIDER_NAME): 

    self.mailer.send(to=["[email protected]"], attachs=attachments, subject="test email", body="Some body") 

看来,在Scrapy以前的版本,你可以通过自成spider_closed功能,但在目前的版本(0.21)的spider_closed功能仅通过蜘蛛名。

任何帮助和/或建议将不胜感激。

回答

3

创建邮件发送类作为管道是不是最好的主意。更好地创建它作为你自己的扩展。你可以在这里阅读更多关于扩展:http://doc.scrapy.org/en/latest/topics/extensions.html

最重要的部分是类方法from_crawler。它呼吁所有的爬虫,而且它可以为您要拦截信号注册您的回调。 例如从我的邮件类这个功能看起来是这样的:

@classmethod 
def from_crawler(cls, crawler): 
    recipients = crawler.settings.getlist('STATUSMAILER_RECIPIENTS') 
    if not recipients: 
     raise NotConfigured 

    mail = MailSender.from_settings(crawler.settings) 
    instance = cls(recipients, mail, crawler) 

    crawler.signals.connect(instance.item_scraped, signal=signals.item_scraped) 
    crawler.signals.connect(instance.spider_error, signal=signals.spider_error) 
    crawler.signals.connect(instance.spider_closed, signal=signals.spider_closed) 
    crawler.signals.connect(instance.item_dropped, signal=signals.item_dropped) 

    return instance 

为了方便使用,记得设置的所有必需数据设置:

EXTENSIONS = { 
    'your.mailer': 80 
} 

STATUSMAILER_RECIPIENTS = ["who should get mail"] 

MAIL_HOST = '***' 
MAIL_PORT = *** 
MAIL_USER = '***' 
MAIL_PASS = '***' 
+0

谢谢你的建议,非常有帮助。 – bornytm