蟒蛇删除怪异撇号和其他怪异字符无法在string.punctuation

这是我的字符串：蟒蛇删除怪异撇号和其他怪异字符无法在string.punctuation

mystring = "How’s it going?"

这是我做过什么：

import string 
exclude = set(string.punctuation) 

def strip_punctuations(mystring): 
    for c in string.punctuation: 
     new_string=''.join(ch for ch in mystring if ch not in exclude) 
     new_string = chat_string.replace("\xe2\x80\x99","") 
     new_string = chat_string.replace("\xc2\xa0\xc2\xa0","") 
    return chat_string

OUTPUT：

如果我没有包括这一行new_string = chat_string.replace("\xe2\x80\x99","")这将是输出：

'How\xe2\x80\x99s it going'

我意识到排除没有在列表中怪异的撇号：

print set(exclude) 
set(['!', '#', '"', '%', '$', "'", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~'])

如何确保所有这些字符都取出来，而不是手动在未来替代它们？

来源

2016-06-21 jxn

Python 2，我假设？ –

yep python 2.7。 – jxn

您不应该将字符串作为utf8字符串。先解码它们。 – Daniel

如果您正在处理新闻文章或网络报废等长文本，那么您可以使用“goose”或“NLTK”python库。这两个不是预先安装的。这里是图书馆的链接。 goose，NLTK

您可以浏览文档并了解如何操作。

，如果你不想使用这些库，您可能需要手动创建自己的“排除”列表中。

来源

2016-06-21 17:19:11

import re 

toReplace = "how's it going?" 
regex = re.compile('[!#%$\"&)\'(+*-/.;:=<?>@\[\]_^`\{\}|~"\\\\"]') 
newVal = regex.sub('', toReplace) 
print(newVal)

正则表达式匹配您设置的所有字符，并用空白替换它们。

来源

2016-06-21 17:32:04 Brunaldo

蟒蛇删除怪异撇号和其他怪异字符无法在string.punctuation

回答

相关问题