2010-06-16 93 views
12

我正在编写一个Python脚本来处理从Procmail返回的电子邮件。在本question建议,我用下面的Procmail配置:用Python解析电子邮件

:0: 
|$HOME/process_mail.py 

我process_mail.py脚本通过标准输入接收电子邮件这样的:

From hostname Tue Jun 15 21:43:30 2010 
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400 
Received: from mail-fx0-f44.google.com (209.85.161.44) 
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400 
Received: by fxm19 with SMTP id 19so170709fxm.3 
for <[email protected]>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
MIME-Version: 1.0 
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 
Jun 2010 18:47:33 -0700 (PDT) 
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
Date: Tue, 15 Jun 2010 20:47:33 -0500 
Message-ID: <[email protected]> 
Subject: TEST 12 
From: Full Name <[email protected]> 
To: [email protected] 
Content-Type: text/plain; charset=ISO-8859-1 

ONE 
TWO 
THREE 

我试图解析消息以这种方式:

>>> import email 
>>> msg = email.message_from_string(full_message) 

我想获得'From','To'和'Subject'等消息字段。但是,消息对象不包含任何这些字段。

我在做什么错?

回答

9

您必须确保线路不会意外地被破坏(因为它们在上面,尽管很难说如果这是一个复制粘贴问题) - 与一个intac吨消息如:

Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400 
Received: from mail-fx0-f44.google.com (209.85.161.44) by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400 
Received: by fxm19 with SMTP id 19so170709fxm.3 for <[email protected]>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
MIME-Version: 1.0 
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
Date: Tue, 15 Jun 2010 20:47:33 -0500 
Message-ID: <[email protected]> 
Subject: TEST 12 
From: Full Name <[email protected]> 
To: [email protected] 
Content-Type: text/plain; charset=ISO-8859-1 

ONE 
TWO 
THREE 

然后

msg = email.message_from_string(msgtxt) 
print msg['Subject'] 

打印TEST 12如所期望。

+0

如何获取电子邮件的正文? – Anuj 2014-02-10 13:09:43

+0

如果你真的想要整个RFC2822电子邮件正文与原始MIME结构和所有,在Python解析消息基本上是多余的;身体是第一条空行之后的一切。通常,对于现代消息,您想要解析MIME结构并提取一个或多个身体部位。 – tripleee 2016-06-22 09:35:14

1

我回答我自己。

我在构建消息的代码中发现了一个错误。它在一些行之间添加换行符,阻止解析器正常工作。

3

看起来你有没有预先考虑的附加线,可根据RFC 2822 §2.3.2这是违法的空间换行符:

每个报头字段是逻辑上包括
字段名称字符的单行,结肠,和领域的身体。然而,为了方便
,并且为了处理每行998/78字符限制,可以将报头字段的字段主体部分拆分为多个行表示;其中,这被称为“折叠”。通用规则是
,只要本标准允许折叠空白区域(不是简单的WSP字符),就可以在任何WSP之前插入CRLF。对于
例如,报头字段:

Subject: This is a test 

可以表示为:

Subject: This 
    is a test 

它应该是这个样子:

From hostname Tue Jun 15 21:43:30 2010 
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400 
Received: from mail-fx0-f44.google.com (209.85.161.44) 
    by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400 
Received: by fxm19 with SMTP id 19so170709fxm.3 
    for <[email protected]>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
MIME-Version: 1.0 
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 
    Jun 2010 18:47:33 -0700 (PDT) 
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT) 
Date: Tue, 15 Jun 2010 20:47:33 -0500 
Message-ID: <[email protected]> 
Subject: TEST 12 
From: Full Name <[email protected]> 
To: [email protected] 
Content-Type: text/plain; charset=ISO-8859-1 

ONE 
TWO 
THREE