如何从推文中删除特殊字符（如`'ŒðŸ'`）

我必须从推文中清除特殊字符，例如ðŸ‘‰ðŸ‘ŒðŸ’¦âœ¨。为了做到这一点，我遵循了这一策略（我使用Python 3）：如何从推文中删除特殊字符（如`'ŒðŸ'`）

从字节转换鸣叫字符串以获得特殊字符为十六进制，所以Ã成为\xc3\;
使用正则表达式，删除b'和b"（在字符串的开头）和'或"（在字符串的末尾）的转换处理之后被Python加入;
最后删除十六进制表示，也使用正则表达式。

这里是我的代码：

import re 
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6 "' 

#encoding to 'utf8' 
tweet_en = tweet.encode('utf8') 
#converting to string 
tweet_str = str(tweet_en) 
#eliminating the b' and b" at the begining of the string: 
tweet_nob = re.sub(r'^(b\'b\")', '', tweet_str) 
#deleting the single or double quotation marks at the end of the string: 
tweet_noendquot = re.sub(r'\'\"$', '', tweet_nob) 
#deleting hex 
tweet_regex = re.sub(r'\\x[a-f0-9]{2,}', '', tweet_noendquot) 
print('this is tweet_regex: ', tweet_regex)

最终输出是：[/Very seldom~ will someone enter your life] to question "（从中我仍然无法删除最后"）。我想知道是否有更好更直接的方式来清除Twitter数据中的特殊字符。任何帮助将不胜感激。

来源

2017-02-21 norpa

我认为这将正常工作，如果你只是在寻找ASCII字符：

initial_str = 'Some text ðŸ‘‰ðŸ‘ŒðŸ’¦âœ¨ and some more text' 
clean_str = ''.join([c for c in initial_str if ord(c) < 128]) 
print(clean_str) # Some text and some more text

你可以做ord(c) in range()，并给它你想保留一定范围的文本（可能包括表情符号）。

来源

2017-02-21 15:44:49 squgeim

如何从推文中删除特殊字符（如`'ŒðŸ'`）

回答

相关问题