Python - pyparsing unicode字符

:)我试过使用w = Word（printables），但它不工作。我应该如何给这个规范。 'w'表示处理印地文字符（UTF-8）Python - pyparsing unicode字符

该代码指定语法并相应地解析。

671.assess :: अहसास ::2 
x=number + "." + src + "::" + w + "::" + number + "." + number

如果只有英文字符它正在工作，所以代码对于ascii格式是正确的，但代码不适用于unicode格式。

我的意思是代码工作的时候，我们有如下形式 671.assess :: ahsaas :: 2

即它解析词语的英文格式的东西，但我不知道如何解析，然后以unicode格式打印字符。我需要英语北印度语单词对齐的目的。

的Python代码如下所示：

# -*- coding: utf-8 -*- 
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables) 
trans = Word(printables) 
number = Word(nums) 
x=number + "." + src + "::" + trans + "::" + number + "." + number 
#parsing for eng-dict 
efiledata = open('b1aop_or_not_word.txt').read() 
eresults = x.parseString(efiledata) 
edict1 = {} 
edict2 = {} 
counter=0 
xx=list() 
for result in eresults: 
    trans=""#translation string 
    ew=""#english word 
    xx=result[0] 
    ew=xx[2] 
    trans=xx[4] 
    edict1 = { ew:trans } 
    edict2.update(edict1) 
print len(edict2) #no of entries in the english dictionary 
print "edict2 has been created" 
print "english dictionary" , edict2 

#parsing for hin-dict 
hfiledata = open('b1aop_or_not_word.txt').read() 
hresults = x.scanString(hfiledata) 
hdict1 = {} 
hdict2 = {} 
counter=0 
for result in hresults: 
    trans=""#translation string 
    hw=""#hin word 
    xx=result[0] 
    hw=xx[2] 
    trans=xx[4] 
    #print trans 
    hdict1 = { trans:hw } 
    hdict2.update(hdict1) 

print len(hdict2) #no of entries in the hindi dictionary 
print"hdict2 has been created" 
print "hindi dictionary" , hdict2 
''' 
####################################################################################################################### 

def translate(d, ow, hinlist): 
    if ow in d.keys():#ow=old word d=dict 
    print ow , "exists in the dictionary keys" 
     transes = d[ow] 
    transes = transes.split() 
     print "possible transes for" , ow , " = ", transes 
     for word in transes: 
      if word in hinlist: 
     print "trans for" , ow , " = ", word 
       return word 
     return None 
    else: 
     print ow , "absent" 
     return None 

f = open('bidir','w') 
#lines = ["'\ 
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \ 
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \ 
#'"] 
data=open('bi_full_2','rb').read() 
lines = data.split('[email protected]#$%') 
loc=0 
for line in lines: 
    eng, hin = [subline.split(' # ') 
       for subline in line.strip('\n').split('\n')] 

    for transdict, source, dest in [(edict2, eng, hin), 
            (hdict2, hin, eng)]: 
     sourcethings = source[2].split() 
     for word in source[1].split(): 
      tl = dest[1].split() 
      otherword = translate(transdict, word, tl) 
      loc = source[1].split().index(word) 
      if otherword is not None: 
       otherword = otherword.strip() 
       print word, ' <-> ', otherword, 'meaning=good' 
       if otherword in dest[1].split(): 
        print word, ' <-> ', otherword, 'trans=good' 
        sourcethings[loc] = str(
         dest[1].split().index(otherword) + 1) 

     source[2] = ' '.join(sourcethings) 

    eng = ' # '.join(eng) 
    hin = ' # '.join(hin) 
    f.write(eng+'\n'+hin+'\n\n\n') 
f.close() 
'''

如果源文件的例子输入一句话是：

1# 5 # modern markets : confident consumers # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0 
[email protected]#$%

的ouptut是这样的： -

1# 5 # modern markets : confident consumers # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0 
[email protected]#$%

输出说明： - 这实现了双向对齐。这意味着英语“现代”的第一个字映射到印地语“AddhUnik”的第一个词，反之亦然。在这里甚至字符也被视为单词，因为它们也是双向映射的组成部分。因此，如果你观察印地文WORD''。有一个空的对齐方式，因为它没有完全停止，所以它与英语句子无关。输出中的第三行基本上代表了一个分隔符，当我们正在处理多个尝试实现双向映射的语句时。

如果我有Unicode（UTF-8）格式的印地文句子，我应该对它做些什么修改。

来源

2010-02-26 boddhisattva

请编辑此问题，使用正确的格式，使问题是可读 –

一般来说，做不过程编码字节串：让他们到适当的Unicode字符串（通过调用其.decode法）尽快，做你的处理总是Unicode字符串，然后，如果你有为了I/O的目的，.encode他们回到你需要的任何字节串编码。

如果你在谈论文字，因为看起来你是在你的代码中，“尽快”是一次：使用u'...'来表达你的文字。在更一般的情况下，如果您不得不以编码形式执行I/O操作，那么在输入后立即执行I/O操作（如果您需要以特定编码形式执行输出，则它就在输出之前）。

来源

2010-02-26 06:08:08

你好先生.. :)谢谢你的回答..无论你在第二段说的是否完全适用于我的情况..我在下面的代码行中试过这个东西： trans = u'Word（ printables）' ，我无法达到预期的输出。如果我对错误行进行了修改，请您纠正我，因为在进行此更改之后，错误即将到来（对于定义grammmar的行，希望在该位置使用printables）。 – boddhisattva

@mgj，不要将unicode字符串文字分配给'trans'，这是没有意义的。只要确保'printables'是一个unicode对象（**不是** utf8编码的字节字符串！ - 也不是带有任何其他编码的字节字符串！），并使用'trans = Word（printables）'。如果你的_file_是utf-8编码，或者使用任何其他编码进行编码，请使用来自'codecs'模块的'codecs.open'对其进行解码，而不是像你在做的那样内置'open'，这样每个' line'是一个unicode对象，而不是一个字节字符串（以任何编码方式）。 –

Pyparsing的printables只处理ASCII字符范围内的字符串。要在完整的Unicode范围printables，像这样：

trans = Word(unicodePrintables)

我无法测试对你的印地文：

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
             if not unichr(c).isspace())

现在你可以使用这个更完整的非空格字符定义trans测试字符串，但我认为这会做伎俩。

（如果你使用Python 3，则没有单独的unichr功能，并且没有的xrange发电机，只需使用：

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
             if not chr(c).isspace())

来源

2010-02-26 09:43:50 PaulMcG

谢谢你的回答，先生.. :) – boddhisattva

这个答案很久以来就已经过时了：unicode不再是16位，循环一切都不是高性能的。 –

@flyingsheep - 好的提示，更新为使用'sys.maxunicode'而不是硬编码常量，因此它会跟踪Python的'sys'模块。至于循环所有的东西，这个位只运行一次，最初定义一个解析器，当用来创建一个pyparsing'Word'时，它被存储为一个set（），所以解析时的性能还是相当不错的。 – PaulMcG

Python - pyparsing unicode字符

回答

相关问题