字符串预处理

我处理字符串列表可能包含一些额外的字母到原来的拼写，例如：字符串预处理

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']

我要预先处理这些字符串，让他们拼写正确，检索一个新的列表：

cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday']

重复的字母可以改变的序列的长度，但是，显然cool应保持其拼写。

我不知道有这样做的任何python库，我希望尽量避免硬编码它。

我试过这个：http://norvig.com/spell-correct.html但是你把更多的单词放在文本文件中，似乎有更多的机会提示不正确的拼写，因此即使没有删除额外的字母，它也从来没有真正得到正确的拼写。例如，eel变成teel ...

在此先感谢。

来源

2016-02-05 user47467

由于任务非常依赖于语言，蟒本身不能为你做它。尝试查找一些拼写更正包，例如https://pypi.python.org/pypi/autocorrect/0.1.0 – javad

请看看这篇文章：http://stackoverflow.com/questions/4500752/python-check - 是否一个单词，被拼写的 - 正确。我建议：1）检查每个单词的拼写。 2）如果不正确，则使用循环尝试删除重复的字母，直到拼写正确。 – Quinn

我认为你不会得到任何真正的答案，除非你提供了一些你写的代码，或者你想到的任何推理 - 算法/论文/链接。 – Markon

如果您要下载所有英文单词的文本文件来检查，这是另一种可行的方法。

我没有测试它，但你明白了。它遍历字母，如果当前字母与最后一个字母匹配，它将从该字中删除该字母。如果它将这些字母缩小到1，并且仍然没有有效的单词，它会将单词重置为正常并继续，直到找到下一个重复的字符。

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday'] 
import urllib2 
word_list = set(i.lower() for i in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n')) 

found_words = [] 
for word in (i.lower() for i in words): 

    #Check word doesn't exist already 
    if word in word_list: 
     found_words.append(word) 
     continue 

    last_char = None 
    i = 0 
    current_word = word 
    while i < len(current_word): 

     #Check if it's a duplicate character 
     if current_word[i] == last_char: 
      current_word = current_word[:i] + current_word[i + 1:] 

     #Reset word if no more duplicate characters 
     else: 
      current_word = word 
      i += 1 
      last_char = current_word[i] 

     #Word has been found 
     if current_word in word_list: 
      found_words.append(current_word) 
      break 

print found_words 
#['why', 'hey', 'alright', 'cool', 'monday']

来源

2016-02-05 14:25:54 Peter

Upvote。喜欢这个想法。目前的输出是'['whyyy'，'heyy'，'alrighttt'，'cool'，'mmmmonday']'，所以它会删除一些结束字符，但不是全部。任何想法为什么？ – user47467

对不起，我每个字符都跑过一次，好像这个单词保持相同的大小，但它不是我设法修复它。为了记录，你需要检查'word_list'是否正确，我必须做'f.read（）。split（'\ r \ n'）'它是一个文本文件，每个单词放在一个新行中。 – Peter

由于某种原因，我得到一个字符串索引超出范围的错误。在'last_char = current_word [i]'行 – user47467

如果它只是重复字母要脱光，然后使用正则表达式模块re可能会有所帮助：（它的叶子“酷”不变）

>>> import re 
>>> re.sub(r'(.)\1+$', r'\1', 'cool') 
'cool' 
>>> re.sub(r'(.)\1+$', r'\1', 'coolllll') 
'cool'

领导重复字符正确的替代将是：

>>> re.sub(r'^(.)\1+', r'\1', 'mmmmonday') 
'monday'

当然这失败的话是合法的开始或重复的字母结尾...

来源

2016-02-05 14:18:24 haavee

好了，粗暴的方式：

words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday'] 

res = [] 
for word in words: 
    while word[-2]==word[-1]: 
     word = word[:-1] 
    while word[0]==word[1]: 
     word = word[1:] 
    res.append(word) 
print(res)

结果： ['why', 'hey', 'alright', 'cool', 'monday']

来源

2016-02-05 16:47:56 Clodion

字符串预处理

回答

相关问题