什么是一个好的策略来分类相似的单词？

说我有电影名称与拼写错误和小的变化像这样的列表 -什么是一个好的策略来分类相似的单词？

"Pirates of the Caribbean: The Curse of the Black Pearl" 
"Pirates of the carribean" 
"Pirates of the Caribbean: Dead Man's Chest" 
"Pirates of the Caribbean trilogy" 
"Pirates of the Caribbean" 
"Pirates Of The Carribean"

如何组或找到这样套的话，最好使用python和/或Redis的？

来源

2011-07-05 abc def foo bar

你想得到什么结果？你想要在整个字符串中查找所有这些变体？ – JMax

我想将这些组合成一个组合对象，并在添加到数据库时执行检查。 –

看看“模糊匹配”。下面的线程中的一些很棒的工具可以计算字符串之间的相似度。

我特别喜欢difflib模块

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) 
['apple', 'ape'] 
>>> import keyword 
>>> get_close_matches('wheel', keyword.kwlist) 
['while'] 
>>> get_close_matches('apple', keyword.kwlist) 
[] 
>>> get_close_matches('accept', keyword.kwlist) 
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

来源

2011-07-05 07:41:07

链接的问题似乎被删除。看起来好像是 – hardmooth

。当你达到一定程度的分数时，你仍然可以看到已删除的问题，因此我将链接保持原样。 –

@FredrikPihl可以请你在这里发布'get_close_matches'的定义（或者编辑它以答复）不配得名声低的农民？ –

为了另一个提示添加到弗雷德里克的答案，你也可以得到来自搜索引擎如代码，像这样的启发：

def dosearch(terms, searchtype, case, adddir, files = []): 
    found = [] 
    if files != None: 
     titlesrch = re.compile('>title<.*>/title<') 
     for file in files: 
      title = "" 
      if not (file.lower().endswith("html") or file.lower().endswith("htm")): 
       continue 
      filecontents = open(BASE_DIR + adddir + file, 'r').read() 
      titletmp = titlesrch.search(filecontents) 
      if titletmp != None: 
       title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8] 
      filecontents = remove_tags(filecontents) 
      filecontents = filecontents.lstrip() 
      filecontents = filecontents.rstrip() 
      if dofind(filecontents, case, searchtype, terms) > 0: 
       found.append(title) 
       found.append(file) 
    return found

问候，

最大

来源

2011-07-05 07:50:10 JMax

我相信其实也有两个不同的问题。

首先是拼写纠正。你可以有一个在Python这里

http://norvig.com/spell-correct.html

二是更多的功能。这是我在拼写更正后要做的事情。我会做一个关系函数。

相关（句子1，句子2）当且仅当句子1和句子2有罕见的常用词。难得的是，我的意思是不同于（The，what，is等等）。您可以查看TF/IDF系统，以确定两个文档是否使用他们的文字相关。只是google搜索了一下，我发现这一点：

https://code.google.com/p/tfidf/

来源

2011-07-05 10:38:08 yogsototh