搜索：匹配来自多个列的所有单词

我试图向现有数据库添加大量名称（1700），但需要检查重复项。实际上，我们假设大多数都是重复的。不幸的是，这些名称来自地址标签，并且不由字段分隔（一些是组织名称，一些是人名）。为了减轻对人的负担，我想先搜索名字上的好匹配。通过良好的匹配，我的意思是我希望名称中的所有单词（John Julie Smith）在几个数据库字段（标题，名字，姓氏，后缀，配偶名称）之间进行匹配。因此，如果firstname是John，lastname是Smith，而spousename是Julie，则匹配，或者如果firstname（在db中）是“John Julie”并且lastname是“Smith”，那么它也匹配。搜索：匹配来自多个列的所有单词

我已经完成了一段脚本，它将在PHP中完成所有操作，并为每一种可能性运行一个单独的查询。像lastname = 'john julie smith'，firstname = 'john julie smith' ... lastname = 'john julie' AND firstname = 'smith'等等等等！这是105个查询三个字的名字，我有1700个名字来处理。这听起来很可笑。

PHP我相当熟悉，但我对MySQL并不擅长。是否有一个查询可以尝试匹配多列中的所有单词？即使它只处理其中一个名称组合（“John，Julie，Smith”或“John Julie，Smith”）。甚至可能使用Regex？

这就是我在这里的地方。

foreach($a as $name) { 
    //There's some more stuff up here to prepare the strings, 
    //removing &/and, punctuation, making everything lower case... 

    $na = explode(" ", $name); 

    $divisions = count($na) - 1; 
    $poss = array(); 
    for($i = 0; $i < pow(2, $divisions); $i++) { 
     $div = str_pad(decbin($i), $divisions, '0', STR_PAD_LEFT); 
     $tpa = array(); 
     $tps = ''; 
     foreach($na as $nak => $nav) { 
      if ($nak > 0 && substr($div, $nak - 1, 1)) { 
       $tpa[] = $tps; 
       $tps = $nav; 
      } else { 
       $tps = trim($tps . ' ' . $nav); 
      } 
     } 
     $tpa[] = $tps; 
     $poss[] = $tpa; 
    } 
    foreach($poss as $possk => $possv) { 
     $count = count($possv); 
     //Here's where I am... 
     //I could use $count and some math to come up with all the possible searches here, 
     //But my head is starting to spin as I try to think of how to do that. 
    } 

    die(); 
}

到目前为止，PHP将创建的名称串词所有可能安排一个数组（$ POSS）。对于“约翰·朱莉·史密斯”，该阵列是这样的：

Array 
(
    [0] => Array 
     (
      [0] => john julie smith 
     ) 

    [1] => Array 
     (
      [0] => john julie 
      [1] => smith 
     ) 

    [2] => Array 
     (
      [0] => john 
      [1] => julie smith 
     ) 

    [3] => Array 
     (
      [0] => john 
      [1] => julie 
      [2] => smith 
     ) 

)

最初的想法是通过数组进行迭代，并创建一个bazillion查询。为[0]，将有5个查询：

... WHERE firstname = 'john julie smith'; 
... WHERE lastname = 'john julie smith'; 
... WHERE spousename = 'john julie smith'; 
... WHERE title = 'john julie smith'; 
... WHERE suffix = 'john julie smith';

但对于[1]将有20个查询：

... WHERE firstname = 'john julie' AND lastname = 'smith'; 
... WHERE firstname = 'john julie' AND spousename = 'smith'; 
... WHERE firstname = 'john julie' AND title = 'smith'; 
... WHERE firstname = 'john julie' AND lastname = 'smith'; 
... WHERE firstname = 'john julie' AND suffix = 'smith'; 
... WHERE lastname = 'john julie' AND firstname = 'smith'; 
... WHERE lastname = 'john julie' AND spousename = 'smith'; 
... WHERE lastname = 'john julie' AND title = 'smith'; 
... WHERE lastname = 'john julie' AND lastname = 'smith'; 
... WHERE lastname = 'john julie' AND suffix = 'smith'; 
//and on and on

为[3]将有60个查询！我正在以此速率查看170,000个以上的查询！

必须有更好的方法...

来源

2014-05-16 Stevish

我加了“我做了什么”。它非常多毛，我只有一半。我希望更简化的方法，并会感谢所有的想法。 – Stevish

将1700名称加载到MySQL中的表中。

然后，我认为以下方法将有所帮助。在字段中查找匹配项，并按照匹配度最高的那些排序。这不是100％完美，我怀疑它会有所帮助。查询是：

select n.name, t.*, 
     (n.name like concat('%', firstname, '%') + 
     n.name like concat('%', lastname, '%') + 
     n.name like concat('%', suffix, '%') + 
     n.name like concat('%', spousename, '%') 
     ) as NumMatches 
from table t join 
    names n 
    on n.name like concat('%', firstname, '%') or 
     n.name like concat('%', lastname, '%') or 
     n.name like concat('%', suffix, '%') or 
     n.name like concat('%', spousename, '%') 
group by t.firstname, t.lastname, t.suffix, t.spousename, n.name 
order by NumMatches;

编辑：

我离开了这一点，在第一时间，但你可以指望的话，在每个name和数量匹配的数目。在order by之前将这个子句：

having NumMatches = length(n.name) - length(replace(n.n, ' ', '')

这仍然是不完美的，因为相同的名称可能是在多个领域。在实践中，它应该工作得很好。如果你想变得更迂腐，你可以这样做：

having concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(n.name, ' ', 1), '%') and 
     concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 2), ' ', -1), '%') and 
     concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 3), ' ', -1), '%') and 
     concat_ws(':', firstname, lastname, suffice, spousename) like concat('%', substring_index(substring_index(n.name, ' ', 4), ' ', -1), '%')

这将独立测试每个名称。

来源

2014-05-16 03:18:26

当人们对每个条目进行仔细研究并将其与现有条目进行比较时，这可能会有所帮助，但似乎没有办法确保'n.name'中的单词匹配。我想在通过他们发送真人之前消除大多数这些编程问题，以查看“匹配”是否匹配 – Stevish

这是很好的东西，特别要感谢concat创意。当我有时间再次参与这个项目时，我会引用这个，如果结果是“接受”你的答案的话。再次感谢！ – Stevish

搜索：匹配来自多个列的所有单词

回答

相关问题