的UnicodeDecodeError： 'ASCII' 编解码器不能解码字节 - Python的

这涉及到以下几个问题 -的UnicodeDecodeError： 'ASCII' 编解码器不能解码字节 - Python的

我有python应用程序执行以下任务 -

# -*- coding: utf-8 -*-

1.阅读Unicode文本文件（非英语） -

def readfile(file, access, encoding): 
    with codecs.open(file, access, encoding) as f: 
     return f.read() 

text = readfile('teststory.txt','r','utf-8-sig')

这给予回报的文本文件作为字符串。

2.将文本分割成句子。

3.经过每一句话，并确定动词，名词等

参考 - Searching for Unicode characters in Python和Find word infront and behind of a Python list

4.添加他们到不同的变量如下

名词=“CAR”| “BUS”|

verbs =“DRIVES”| “命中”

5.现在我想将它们传递到NLTK背景如下自由语法 -

grammar = nltk.parse_cfg(''' 
    S -> NP VP 
    NP -> N 
    VP -> V | NP V 

    N -> '''+nouns+''' 
    V -> '''+verbs+''' 
    ''')

它给了我下面的错误 -

line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)

哪能克服这个问题并将变量传递给NLTK CFG？

完整代码 - https://dl.dropboxusercontent.com/u/4959382/new.zip

来源

2013-08-18 ChamingaD

你可以显示错误的* full * traceback吗？ – Bakuriu

我正在使用Pycharm。我如何打印完整的追溯？ print_stack（）不起作用。无论如何，可以找出与给定的例外问题？ – ChamingaD

'输入日志;尝试：你的代码;除了：logging.exception（“ouch”）'＃为了清楚起见，使用换行符和缩进代替';' –

总之你有这些策略：

对待输入作为字节序列，然后输入和语法是UTF-8编码的数据（字节）
治疗输入为unicode代码点序列，则输入和语法都是unicode。
将unicode代码点重命名为ascii，即使用转义序列。

与pip，2.0一起安装的nltk。4在我的情况下，不直接接受unicode的，但接受报价的Unicode常量，这是所有以下都不能工作：

In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar') 
Out[26]: <Grammar with 2 productions> 

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8")) 
Out[27]: <Grammar with 2 productions> 

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape")) 
Out[28]: <Grammar with 2 productions>

注意，我引用Unicode文本，而不是普通的文本"€" VS bar。

来源

2013-08-19 14:23:48

的UnicodeDecodeError： 'ASCII' 编解码器不能解码字节 - Python的

回答

相关问题