2012-10-06 78 views
3

我总是使用阿拉伯语文本文件,并且为了避免编码问题,我根据Buckwalter的方案将阿拉伯字符翻译为英语(http://www.qamus.org/transliteration.htm)使用Python快速翻译阿拉伯语文本

这是我的代码,但是即使对于像400 kb的小文件,它也很慢。想法使其更快?

感谢

 def transliterate(file): 
      data = open(file).read() 
      buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}  
      for char in data: 
       for k, v in arabBuck.iteritems(): 
        data = data.replace(k,v)     
     return data 

回答

6

顺便说一下,已经有人写了一个脚本,这样做,所以你可能要检查出花太多时间在自己的面前: buckwalter2unicode.py

可能它不是你所需要的更多,但你不不必全部使用它:我只复制了两个字典和transliterateString函数(我认为有一些调整),并在我的网站上使用它。

编辑: 上面的脚本就是我一直在使用,但我只是发现它比使用替代,尤其是对大型语料库慢。这是我终于结束了代码,这似乎是更简单,更快(this引用字典buck2uni):

def transString(string, reverse=0): 
    '''Given a Unicode string, transliterate into Buckwalter. To go from 
    Buckwalter back to Unicode, set reverse=1''' 

    for k, v in buck2uni.items(): 
     if not reverse: 
      string = string.replace(v, k) 
     else: 
      string = string.replace(k, v) 

    return string 
+0

有没有Urdu语言的任何字典? –

+0

@ShanKhan - 不是我所知道的(不是我会知道的),但是你可以把上面的脚本,并修改字典来与乌尔都语一起工作。你只需要查找所有字母的Unicode代码。祝你好运! – larapsodia

+0

谢谢我这样做,它的工作 –

3

你重做为每个字符相同的工作。当您执行data = data.replace(k, v)时,将替换整个文件中给定字符的全部。但是你一遍又一遍地循环播放,当你只需要为每个音译对做一次。只需删除最外层的循环,它会极大地加快代码的速度。

如果您需要更多地优化它,您可以查看字符串translate method。我不确定这是如何在性能方面。

4

每当你所要做的音译str.translate是使用的方法:

>>> import timeit 
>>> buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"} 
>>> def repl(data, table): 
...  for k,v in table.iteritems(): 
...   data = data.replace(k, v) 
... 
>>> def trans(data, table): 
...  return data.translate(table) 
... 
>>> T = u'This is a test to see how fast is translitteration' 
>>> timeit.timeit('trans(T, buckArab)', 'from __main__ import trans, T, buckArab', number=10**6) 
6.766200065612793 
>>> T = 'This is a test to see how fast is translitteration' #in python2 requires ASCII string 
>>> timeit.timeit('repl(T, buckArab)', 'from __main__ import repl, T, buckArab', number=10**6) 
12.668706893920898 

正如你甚至可以看到小弦str.translate快2倍。

+0

降压= U “'|>及<} AbptvjHxd * RZS $ SDTZEg_fqklmnhwYyFNKaui〜o” 的 阿拉伯语= U“ءآأؤإئابةتثجحخدذرزسشصضطظعغفقكلمنهوىي “ 特兰特= maketrans(树胶,降压) 我的代码停止与该错误消息的第三行: UnicodeEncodeError:‘ASCII’编解码器不能编码在0-44位字符:顺序不在范围内(128) – Sabba

+0

我认为问题在于'string.maketrans'只适用于ASCII字符串,而您希望为unicode执行此操作。你已经有了一本将阿拉伯文映射到英文的字典,你为什么不像我那样使用它? – Bakuriu

+0

有没有像这个urdu语言这样的字典? –

3

每当我对Unicode的使用str.translate对象返回完全相同的对象。也许这是由于the change in behavior alluded to by Martijn Peters

如果任何人在那里挣扎音译unicode的,如阿拉伯语为ascii,我发现,映射序号到Unicode文本效果很好。

>>> buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"} 
>>> ordbuckArab = {ord(v.decode('utf8')): unicode(k) for (k, v) in buckArab.iteritems()} 
>>> ordbuckArab 
{1569: u"'", 1570: u'|', 1571: u'?', 1572: u'&', 1573: u'<', 1574: u'}', 1575: u'A', 1576: u'b', 1577: u'p', 1578: u't', 1579: u'v', 1580: u'g', 1581: u'H', 1582: u'x', 1583: u'd', 1584: u'*', 1585: u'r', 1586: u'z', 1587: u's', 1588: u'$', 1589: u'S', 1590: u'D', 1591: u'T', 1592: u'Z', 1593: u'E', 1594: u'G', 1600: u'_', 1601: u'f', 1602: u'q', 1603: u'k', 1604: u'l', 1605: u'm', 1606: u'n', 1607: u'h', 1608: u'w', 1609: u'Y', 1610: u'y', 1611: u'F', 1612: u'N', 1613: u'K', 1614: u'a', 1615: u'u', 1616: u'i', 1617: u'~', 1618: u'o'} 
>>> u'طعصط'.translate(ordbuckArab) 
u'TEST' 
1

扩展@ larapsodia的答案,这里是词典的完整代码:

# -*- coding: utf-8 -*- 

# Arabic Transliteration based on Buckwalter 
# dictionary source is buckwalter2unicode.py http://www.redhat.com/archives/fedora-extras-commits/2007-June/msg03617.html 

buck2uni = {"'": u"\u0621", # hamza-on-the-line 
      "|": u"\u0622", # madda 
      ">": u"\u0623", # hamza-on-'alif 
      "&": u"\u0624", # hamza-on-waaw 
      "<": u"\u0625", # hamza-under-'alif 
      "}": u"\u0626", # hamza-on-yaa' 
      "A": u"\u0627", # bare 'alif 
      "b": u"\u0628", # baa' 
      "p": u"\u0629", # taa' marbuuTa 
      "t": u"\u062A", # taa' 
      "v": u"\u062B", # thaa' 
      "j": u"\u062C", # jiim 
      "H": u"\u062D", # Haa' 
      "x": u"\u062E", # khaa' 
      "d": u"\u062F", # daal 
      "*": u"\u0630", # dhaal 
      "r": u"\u0631", # raa' 
      "z": u"\u0632", # zaay 
      "s": u"\u0633", # siin 
      "$": u"\u0634", # shiin 
      "S": u"\u0635", # Saad 
      "D": u"\u0636", # Daad 
      "T": u"\u0637", # Taa' 
      "Z": u"\u0638", # Zaa' (DHaa') 
      "E": u"\u0639", # cayn 
      "g": u"\u063A", # ghayn 
      "_": u"\u0640", # taTwiil 
      "f": u"\u0641", # faa' 
      "q": u"\u0642", # qaaf 
      "k": u"\u0643", # kaaf 
      "l": u"\u0644", # laam 
      "m": u"\u0645", # miim 
      "n": u"\u0646", # nuun 
      "h": u"\u0647", # haa' 
      "w": u"\u0648", # waaw 
      "Y": u"\u0649", # 'alif maqSuura 
      "y": u"\u064A", # yaa' 
      "F": u"\u064B", # fatHatayn 
      "N": u"\u064C", # Dammatayn 
      "K": u"\u064D", # kasratayn 
      "a": u"\u064E", # fatHa 
      "u": u"\u064F", # Damma 
      "i": u"\u0650", # kasra 
      "~": u"\u0651", # shaddah 
      "o": u"\u0652", # sukuun 
      "`": u"\u0670", # dagger 'alif 
      "{": u"\u0671", # waSla 
} 

def transString(string, reverse=0): 
    '''Given a Unicode string, transliterate into Buckwalter. To go from 
    Buckwalter back to Unicode, set reverse=1''' 

    for k, v in buck2uni.items(): 
     if not reverse: 
      string = string.replace(v, k) 
     else: 
      string = string.replace(k, v) 

    return string 


>>> print(transString(u'مرحبا')) 
mrHbA 
>>> print(transString('mrHbA', 1)) 
مرحبا 
>>> 
+0

你可以请提供一些关于字典urdu语言的参考 –

+0

我需要'>>> print(transString(variable.decode('utf8')))' – shadi