最快的方法来确定两个字符串是否有大字符串中的单个字母不同

我想比较两个字符串并将其中一个字符串添加到列表中，如果它们几乎相等（由单个字母不同）。这样做的最快方式是什么，因为我的单词超过了90k，而这样做往往需要很长时间？最快的方法来确定两个字符串是否有大字符串中的单个字母不同

编辑：其中一个单词（比较_字在下面的代码中）不会改变。

EDIT2：单词必须相等长度的

这是我当前的代码：

for word in set_of_words: 
     amount = 0 
     if len(word) == len(comparison_word): 
      for i in range(len(word)): 
       if comparison_word[i] != word[i]: 
        amount += 1 
      if amount == 1: 
       list_of_words.append(word) 
    return list_of_words

来源

2016-03-24 sleepless

'foo'和'fo'呢？ –

你的“单词集”是如何改变的？ –

这样做是为了减少工作量正在做：

n_comparison_word = len(comparison_word) 
for word in set_of_words: 
    amount = 0 
    n_word = len(word) 
    if n_word != n_comparison_word: 
     continue 
    for i in range(n_word): 
     if comparison_word[i] != word[i]: 
      amount += 1 
     if amount == 2: 
      break 
    if amount == 1: 
     list_of_words.append(word) 
return list_of_words

一些备注：

该值需要仅计算一次（有）len(comparison_word)的e。
len(word)的值需要计算一次（每循环迭代）。
你知道，当amount达到值2（或更多 - 无论如何该单词不能再成为结果的一部分）时，您可以停止查看单词。

这可能是值得一读this part of the Python documentation关于continue和break声明它们都是在代码中使用。

来源

2016-03-24 19:02:21

我认为他们正在试图比较所有的单词。所以我想，第一步是创建一个字长度的字典，其中每个散列都包含一个唯一字的列表。 –

这使我的平均时间从40秒减少到30秒 – sleepless

未做详尽的测试，但如果comparison_word不是太长（少于6个字母），并且您的set_of_words可以更改，那么计算所有可接受的单词，将它们存储在一个集合中，只需遍历set_of_words并测试word in acceptable_words即可。

如果不是，这是我拿上你的代码：

for word in set_of_words: 
    different_letter_exists = False 
    length = len(word) 
    if length == len(comparison_word): 
    for i, letter in enumerate(word): 
     if letter != comparison_word[i]: 
      if different_letter_exists: 
       break 
      else: 
       different_letter_exists = True 
    if i == length: 
     list_of_words.append(word)

本质：为每一个字，一旦你遇到一个不同的字母，different_letter_exists设置为True。如果你再次遇到它，你会跳出循环。只有在i == length时才会添加新词，只有在enumerate一直到最后才会发生，只有当只有一个不同的字母存在时才会发生。

祝你好运:)

来源

2016-03-24 19:18:38 thecardkid

您可能会发现拉链是一个比索引更有效：

def almost_equal(set_of_words,comp): 
    ln = len(comp) 
    for word in set_of_words: 
     count = 0 
     if len(word) == ln: 
      for a, b in zip(word, comp): 
       count += a != b 
       if count == 2: 
        break 
      else: 
       yield word

演示：

In [5]: list(almost_equal(["foo","bar","foob","foe"],"foa")) 
Out[5]: ['foo', 'foe']

来源

2016-03-24 19:22:30

这与[此解决方案]（http://stackoverflow.com/a/36208085/3566755）一起将我的平均时间从40秒 – sleepless

下降到28秒到一个列表可能会稍微快一点 –

以下搜索我在大约25的61K字字典毫秒。

import re 

def search(word, text): 
    ws = [r'\b{}[^{}]{}\b'.format(w[:i],w[i],w[i+1:]) for i in range(len(word))] 

    for mo in re.finditer('|'.join(ws), text): 
     yield mo.group() 

with open("/12dicts/5desk.txt") as f: 
    text = f.read() 

for hit in search('zealoos', text): 
    print(hit)       #prints zealous

假定该字符串列表是在一个文件中，每行一个字符串，它读成一个长字符串，并使用正则表达式搜索匹配的字符串。

search()接受一个单词，如“什么”和把它变成一个正则表达式是这样的：

\b[^w]hat\b|\bw[^h]at\b|\bwh[^a]t\b|\bwha[^t]\b

，然后扫描所有的单词，并找到所有有惊无险 - 在C-速度。

来源

2016-03-25 07:45:39 RootTwo

最快的方法来确定两个字符串是否有大字符串中的单个字母不同

回答

相关问题