Python/Django/MySQL“错误的字符串值”错误

我正在运行Django 1.4.2/Python 2.7.3/MySQL 5.5.28站点。该站点的一个特点是管理员可以发送一封电子邮件给服务器，该服务器通过procmail调用Python脚本来解析电子邮件并将其扔到数据库中。我维护网站的两个版本 - 开发和生产网站。这两个网站使用不同但相同的vitualenvs（我甚至删除了它们并重新安装了所有软件包以确保它们正常）。Python/Django/MySQL“错误的字符串值”错误

我遇到了一个奇怪的问题。确切相同的脚本在开发服务器上成功并在生产服务器上失败。它失败，此错误：

...django/db/backends/mysql/base.py:114: Warning: Incorrect string value: '\x92t kno...' for column 'message' at row 1

我很清楚的Unicode问题Django的了，而且我知道这里有一吨的问题，对SO有关此错误，但我确信来设置数据库UTF-8从开始：

mysql> show variables like "character_set_database"; 
+------------------------+-------+ 
| Variable_name   | Value | 
+------------------------+-------+ 
| character_set_database | utf8 | 
+------------------------+-------+ 
1 row in set (0.00 sec) 

mysql> show variables like "collation_database"; 
+--------------------+-----------------+ 
| Variable_name  | Value   | 
+--------------------+-----------------+ 
| collation_database | utf8_general_ci | 
+--------------------+-----------------+ 
1 row in set (0.00 sec)

此外，我知道每一列都可以有自己的字符集，但message列确UTF-8：

mysql> show full columns in listserv_post; 
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+ 
| Field  | Type   | Collation  | Null | Key | Default | Extra   | Privileges      | Comment | 
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+ 
| id   | int(11)  | NULL   | NO | PRI | NULL | auto_increment | select,insert,update,references |   | 
| thread_id | int(11)  | NULL   | NO | MUL | NULL |    | select,insert,update,references |   | 
| timestamp | datetime  | NULL   | NO |  | NULL |    | select,insert,update,references |   | 
| from_name | varchar(100) | utf8_general_ci | NO |  | NULL |    | select,insert,update,references |   | 
| from_email | varchar(75) | utf8_general_ci | NO |  | NULL |    | select,insert,update,references |   | 
| message | longtext  | utf8_general_ci | NO |  | NULL |    | select,insert,update,references |   | 
+------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+ 
6 rows in set (0.00 sec)

没有人有任何我为什么我得到这个错误？为什么它在生产配置下发生，但不是dev配置？

谢谢！

[编辑1]
要清楚，数据也是一样的。我发送一封邮件到服务器，并且procmail发送它。这是.procmailrc文件的样子：

VERBOSE=off 
:0 
{ 
    :0c 
    | <path>/dev/ein/scripts/process_new_mail.py dev > outputdev 

    :0 
    | <path>/prd/ein/scripts/process_new_mail.py prd > outputprd 
}

有process_new_mail.py 2份，但是这只是因为它的版本控制，这样我可以保持两个独立的环境。如果我区分两个输出文件（包含收到的消息），它们是相同的。

我其实刚刚发现dev和prd configs都失败了。不同之处在于dev的配置默默无闻（可能与DEBUG的设置有关？）。问题是其中一个消息中有一些unicode字符，并且出于某种原因，Django在扼杀它们。我正在取得进展......

我试过编辑代码来显式地将消息编码为ASCII和UTF-8，但它仍然不工作。不过，我越来越近了。

来源

2012-11-19 Geoff

你说**代码**是相同的，但是**数据**呢？也许这个bug也存在于开发环境中，你只是不知道它。尝试将该行添加到开发环境并查看是否发生了相同的错误。 – mgibsonbr

数据也完全一样。我正在向服务器发送一封电子邮件，并且我有procmail同时调用开发和生产脚本。我甚至输出这个信息只是为了确保和区分它们，它们是相同的。 – Geoff

我的直觉可能是错误的，但它似乎应该是数据库问题。有两个不同的数据库，一个用于开发，一个用于prd，但我无法发现导致此问题的两者之间的差异。 – Geoff

我修好了！问题在于我没有正确解析关于这些字符集的电子邮件。我的固定电子邮件解析代码来自this post和this post：

#get the charset of an email 
#courtesy http://ginstrom.com/scribbles/2007/11/19/parsing-multilingual-email-with-python/ 
def get_charset(message, default='ascii'): 
    if message.get_content_charset(): 
     return message.get_content_charset() 

    if message.get_charset(): 
     return message.get_charset() 

    return default 

#courtesy https://stackoverflow.com/questions/7166922/extracting-the-body-of-an-email-from-mbox-file-decoding-it-to-plain-text-regard 
def get_body(message): 
    body = None 

    #Walk through the parts of the email to find the text body. 
    if message.is_multipart(): 
     for part in message.walk(): 
      #If part is multipart, walk through the subparts. 
      if part.is_multipart(): 
       for subpart in part.walk(): 
        if subpart.get_content_type() == 'text/plain': 
         #Get the subpart payload (i.e., the message body). 
         charset = get_charset(subpart, get_charset(message)) 
         body = unicode(subpart.get_payload(decode=True), charset) 
      #Part isn't multipart so get the email body. 
      elif part.get_content_type() == 'text/plain': 
       charset = get_charset(subpart, get_charset(message)) 
       body = unicode(part.get_payload(decode=True), charset) 
    #If this isn't a multi-part message then get the payload (i.e., the message body). 
    elif message.get_content_type() == 'text/plain': 
     charset = get_charset(subpart, get_charset(message)) 
     body = unicode(message.get_payload(decode=True), charset) 

    return body

非常感谢您的帮助！

来源

2012-11-20 20:15:02 Geoff

Python/Django/MySQL“错误的字符串值”错误

回答

相关问题