Python正则表达式与ØÆÅ字母

我是Python新手，所以这看起来很容易。我试图删除所有＃，数字，如果相同的字母连续重复两次以上，我需要将其更改为只有两个字母。这个工作完美，但不与ØÆÅ。Python正则表达式与ØÆÅ字母

任何想法如何用ØÆÅ字母做这件事？

#!/usr/bin/python 
# -*- coding: utf-8 -*- 

import math, re, sys, os, codecs 
reload(sys) 
sys.setdefaultencoding('utf-8') 
text = "ån9d ånd ååååånd d9d flllllløde... :)asd " 

# Remove anything other than digits 
text = re.sub(r'#', "", text) 
text = re.sub(r"\d", "", text) 
text = re.sub(r'(\w)\1+', r'\1\1', text) 
print "Phone Num : "+ text

结果我现在得到的是：

Phone Num : ånd ånd ååååånd dd flløde... :)asd

我要的是：

Phone Num : ånd ånd åånd dd flløde... :)asd

来源

2013-05-15 boje

我们之前报道过，不是吗？使用Unicode，而不是字节字符串。 –

从我[回答你以前的问题]（http://stackoverflow.com/questions/16549161/python-re-compile-and-split-with-charcters/16549766#16549766）：*在Python 2中，你会使用[unicode字符串示例]，请注意字符串*和* [带有re.UNICODE集的正则表达式] *中的前导u前缀。 –

嗨@MartijnPieters，通过查看你的意见，尝试一些事情，我找到了解决办法。 – boje

您需要使用Unicode值的工作，而不是与字节串。 UTF-8编码的å为两个字节和正则表达式匹配\w仅限于以默认的不支持Unicode的模式运行时匹配ASCII字母，数字和下划线。

从re module documentation上\w：

当未指定LOCALE和UNICODE标志，匹配任何字母数字字符和下划线;这相当于集[a-zA-Z0-9_]。使用LOCALE时，它将匹配集[0-9_]加上任何字符被定义为当前语言环境的字母数字。如果设置了UNICODE，则它将与字符[0-9_]以及Unicode字符属性数据库中分类为字母数字的任何字符匹配。

不幸的是，即使当切换到正常使用Unicode值（使用一个unicode u''字面或由源数据解码以Unicode值），使用Unicode的正则表达式（re.sub(ur'...')），并使用re.UNICODE标志来切换\w匹配的Unicode字母数字字符，Python的re模块具有用于Unicode的匹配仍然是有限的支持：

>>> print re.sub(ur'(\w)\1+', r'\1\1', text, re.UNICODE) 
ånd ånd ååååånd dd flløde... :)asd

因为å没有被识别为字母数字：

>>> print re.sub(ur'\w', '', text, re.UNICODE) 
å å ååååå ø... :)

的解决方案是使用外部regex library这是一个版本的re库，增加了适当的完整的Unicode支持：

>>> import regex 
>>> print regex.sub(ur'(\w)\1+', r'\1\1', text, re.UNICODE) 
ånd ånd åånd dd flløde... :)asd

该模块可以做的不仅仅是认识的Unicode值多个字母数字字符，有关更多详细信息，请参阅链接的包页面

来源

2013-05-15 09:01:47

变化：

text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd "

和

text = re.sub(r'(\w)\1+', r'\1\1', text)

COMPELTE SOLUTION

import math, re, sys, os, codecs 
reload(sys) 
sys.setdefaultencoding('utf-8') 
text = u"ån9d ånd åååååååånd d9d flllllløde... :)asd " 

# Remove anything other than digits 
text = re.sub(r'#', "", text) 
text = re.sub(r"\d", "", text) 
text = re.sub(r'(\w)\1+', r'\1\1', text) 
text = re.sub(r'(\W)\1+', r'\1\1', text) 
print "1: "+ text

打印：

1: ånd ånd åånd dd flløde.. :)asd

来源

2013-05-15 09:14:16 boje

也是一个选项;请注意，您现在正在将'...'更改为'..'，但这可能适合您的需求。 –

Python正则表达式与ØÆÅ字母

回答

相关问题