在python中处理非ASCII代码字符串

在python中处理非ASCII代码字符真是令人困惑。任何人都可以解释吗？在python中处理非ASCII代码字符串

我想读取纯文本文件并用空格替换所有非字母字符。

我有字符的列表：

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

每个令牌我得到了，我通过调用

for punc in ignorelist: 
     token = token.replace(punc, ' ')

通知更换与空间令牌任何字符有一个非ASCII码字符在ignorelist的结尾：u'—'

每当我的代码遇到该字符时，它崩溃并说：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

我试图通过在文件的顶部添加# -*- coding: utf-8 -*-来声明编码，但仍然无法工作。有谁知道为什么？谢谢！

来源

2013-04-01 bolei

您正在使用Python 2.x，它会尝试自动转换unicode s和普通str s，但它通常会失败并显示非ascii字符。

您不应该混合使用unicode s和str s。您可以坚持unicode S：

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—') 

if not isinstance(token, unicode): 
    token = token.decode('utf-8') # assumes you are using UTF-8 
for punc in ignorelist: 
    token = token.replace(punc, u' ')

或者只使用纯str S（注意最后一个）：

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8')) 
# and other parts do not need to change

通过手动编码您u'—'成str，Python将不需要尝试一下。

我建议你在程序中使用unicode以避免这种错误。但如果工作太多，可以使用后一种方法。但是，当您调用标准库或第三方模块中的某些功能时请注意。

# -*- coding: utf-8 -*-只告诉Python你的代码是用UTF-8编写的（或者你会得到一个SyntaxError）。

来源

2013-04-01 03:08:03 lilydjwg

您的文件输入不是utf-8。所以，当你在比较中输入你的输入栏时，因为它将你的输入视为ascii。

尝试用此读取文件。

import codecs 
f = codecs.open("test", "r", "utf-8")

来源

2013-04-01 03:08:23 klobucar

谢谢，这工作！ – bolei

我想要upvote你。但我的分数低于15，我不能投票......对不起！ – bolei

在python中处理非ASCII代码字符串

回答

相关问题