2016-01-17 66 views
0

我有一个从reddit的拉职位和他们在Twitter上简单的Python脚本。不幸的是,今晚它开始出现我所假设的问题,因为某人在reddit上的标题有格式问题。那我reciving的错误是:Python脚本接收UnicodeEncodeError:“ASCII”编解码器不能编码字符

File "redditbot.py", line 82, in <module> 
    main() 
File "redditbot.py", line 64, in main 
tweeter(post_dict, post_ids) 
File "redditbot.py", line 74, in tweeter 
print post+" "+post_dict[post]+" #python" 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128) 

这里是我的脚本:

# encoding=utf8 
import praw 
import json 
import requests 
import tweepy 
import time 
import urllib2 
import sys 
reload(sys) 
sys.setdefaultencoding('utf8') 

access_token = 'hidden' 
access_token_secret = 'hidden' 
consumer_key = 'hidden' 
consumer_secret = 'hidden' 


def strip_title(title): 
    if len(title) < 75: 
    return title 
else: 
    return title[:74] + "..." 

def tweet_creator(subreddit_info): 
post_dict = {} 
post_ids = [] 
print "[bot] Getting posts from Reddit" 
for submission in subreddit_info.get_hot(limit=2000): 
    post_dict[strip_title(submission.title)] = submission.url 
    post_ids.append(submission.id) 
print "[bot] Generating short link using goo.gl" 
mini_post_dict = {} 
for post in post_dict: 
    post_title = post 
    post_link = post_dict[post] 

    mini_post_dict[post_title] = post_link 
return mini_post_dict, post_ids 

def setup_connection_reddit(subreddit): 
print "[bot] setting up connection with Reddit" 
r = praw.Reddit('PythonReddit PyReTw' 
      'monitoring %s' %(subreddit)) 
subreddit = r.get_subreddit('python') 
return subreddit 



def duplicate_check(id): 
found = 0 
with open('posted_posts.txt', 'r') as file: 
    for line in file: 
     if id in line: 
      found = 1 
return found 

def add_id_to_file(id): 
with open('posted_posts.txt', 'a') as file: 
    file.write(str(id) + "\n") 

def main(): 
subreddit = setup_connection_reddit('python') 
post_dict, post_ids = tweet_creator(subreddit) 
tweeter(post_dict, post_ids) 

def tweeter(post_dict, post_ids): 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret) 
api = tweepy.API(auth) 
for post, post_id in zip(post_dict, post_ids): 
    found = duplicate_check(post_id) 
    if found == 0: 
     print "[bot] Posting this link on twitter" 
     print post+" "+post_dict[post]+" #python" 
     api.update_status(post+" "+post_dict[post]+" #python") 
     add_id_to_file(post_id) 
     time.sleep(3000) 
    else: 
     print "[bot] Already posted" 

if __name__ == '__main__': 
main() 

任何帮助将是非常赞赏 - 在此先感谢!

+1

你介意修理你的例子的缩进:例如,格式和打印字节之前编码post明确? – karlson

+0

你可能会觉得这篇文章有用:[Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html),这是SO老将Ned Batchelder写的。 –

回答

1

问题可能源自于串联混合字节串和unicode字符串。作为在u前缀所有字符串文字的替代方法,可能为

from __future__ import unicode_literals 

为您修复了一些事情。请参阅here以获得更深入的解释,并决定它是否适合您。

2

你要打印unicode字符串到终端(或者可能是通过IO重定向文件),但您的终端(或文件系统)中使用的编码是ASCII。由于Python试图将其从unicode表示转换为ASCII,但因为代码点u'\u201c')无法用ASCII表示,所以它失败。有效地你的代码是这样做的:

>>> print u'\u201c'.encode('ascii') 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128) 

你可以尝试转换为UTF-8:

print (post + " " + post_dict[post] + " #python").encode('utf8') 

或转换为ASCII这样的:

print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace') 

将取代无效的ASCII字符与?

另一种方式,如果要打印的调试的目的是有用的,是打印字符串的repr

print repr(post + " " + post_dict[post] + " #python") 

这将输出是这样的:

>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c' 
>>> print repr(s) 
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c' 
3

考虑这个简单的程序:

print(u'\u201c' + "python") 

如果您尝试打印到终端L(用适当的字符编码),你会得到

“python 

但是,如果你试图输出重定向到一个文件,你会得到一个UnicodeEncodeError

script.py > /tmp/out 
Traceback (most recent call last): 
    File "/home/unutbu/pybin/script.py", line 4, in <module> 
    print(u'\u201c' + "python") 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128) 

当您打印到终端时,Python使用终端的字符编码来编码unicode。 (终端只能打印字节,所以unicode的必须按顺序进行编码,以进行打印。)

当重定向输出到文件,Python不能确定字符编码,因为文件没有声明编码。因此默认情况下,Python2在写入文件之前使用ascii编码隐式编码所有unicode。由于u'\u201c'不能被ascii编码,所以UnicodeEncodeError。 (只有前127个unicode代码点可以用ascii编码)。

此问题在Why Print Fails wiki中有详细说明。


要解决这个问题,首先要避免添加unicode和字节字符串。这会导致使用Python2中的ascii编解码器进行隐式转换,以及Python3中的异常。为了将来能够验证你的代码,最好是明确的。

post = post.encode('utf-8') 
print('{} {} #python'.format(post, post_dict[post])) 
相关问题