在土耳其文中查找ngram与nltk

我正在努力在含有unicode字符的土耳其文中查找ngrams。这里是我的代码：在土耳其文中查找ngram与nltk

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
import nltk 
from nltk import word_tokenize 
from nltk.util import ngrams 

def find_bigrams(): 
    t = "çağlar boyunca geldik çağlar aktı gitti. çağlar aktı" 
    token = nltk.word_tokenize(t) 
    bigrams = ngrams(token,2) 
    for i in bigrams: 
     print i 

find_bigrams()

输出：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

当我改变这样的文字：

t = "çağlar boyunca geldik çağlar aktı gitti"

输出也改变：

('\xc3\xa7a\xc4\x9flar', 'boyunca') 
('boyunca', 'geldik') 
('geldik', '\xc3\xa7a\xc4\x9flar') 
('\xc3\xa7a\xc4\x9flar', 'akt\xc4\xb1') 
('akt\xc4\xb1', 'gitti')

我怎样才能解决这个unicode问题？另一个问题是如何将这些标记转换为字符串（没有')字符）

来源

2015-11-29 JayGatsby

这不像一个unicode问题那么多的NLTK问题。

这可以通过从__future__添加正确的导入来解决;在这种情况下，您需要unicode_literals。

注意从我的Mac的这个例子中安装Python 2.7.10的：

>>> from __future__ import unicode_literals 
>>> t = "çağlar boyunca geldik çağlar aktı gitti. çağlar aktı" 
>>> print(t) 
çağlar boyunca geldik çağlar aktı gitti. çağlar aktı

bigrams是一个元组列表，所以要去除括号，你可以在每对列表中的迭代。

>>> tup = ("hello", "world") 
>>> print tup 
(u'hello', u'world') 
>>> l = [tup] 
>>> for i in l: 
... print(i) 
... 
(u'hello', u'world') 
>>> for i,j in l: 
... print("{0} {1}".format(i, j)) 
... 
hello world

结合在你的脚本这些想法：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
from __future__ import unicode_literals 
import nltk 
from nltk import word_tokenize 
from nltk.util import ngrams 

def find_bigrams(): 
    t = "çağlar boyunca geldik çağlar aktı gitti. çağlar aktı" 
    token = nltk.word_tokenize(t) 
    bigrams = ngrams(token,2) 
    for i, j in bigrams: 
     print("{0} {1}".format(i, j)) 

find_bigrams()

来源

2015-11-29 16:04:43 erip

它的工作原理与此字符串，但是当我尝试从文本文件导入大字符串我仍然得到的第一个错误 – JayGatsby

我得到这个错误： TypeError：'encoding'是此函数的一个无效关键字参数 – JayGatsby

这一个工作谢谢t = codecs.open（'siirclear.txt'，'r'，encoding ='utf-8'）。read（） – JayGatsby

在土耳其文中查找ngram与nltk

回答

相关问题