我试图向现有数据库添加大量名称(1700),但需要检查重复项。实际上,我们假设大多数都是重复的。不幸的是,这些名称来自地址标签,并且不由字段分隔(一些是组织名称,一些是人名)。为了减轻对人的负担,我想先搜索名字上的好匹配。通过良好的匹配,我的意思是我希望名称中的所有单词(John Julie Smith)在几个数据库字段(标题,名字,姓氏,后缀,配偶名称)之间进行匹配。因此,如果firstname是John,lastname是Smith,而spousename是Julie,则匹配,或者如果firstname(在db中)是“John Julie”并且lastname是“Smith”,那么它也匹配。搜索:匹配来自多个列的所有单词
我已经完成了一段脚本,它将在PHP中完成所有操作,并为每一种可能性运行一个单独的查询。像lastname = 'john julie smith'
,firstname = 'john julie smith'
... lastname = 'john julie' AND firstname = 'smith'
等等等等!这是105个查询三个字的名字,我有1700个名字来处理。这听起来很可笑。
PHP我相当熟悉,但我对MySQL并不擅长。是否有一个查询可以尝试匹配多列中的所有单词?即使它只处理其中一个名称组合(“John,Julie,Smith”或“John Julie,Smith”)。甚至可能使用Regex?
这就是我在这里的地方。
foreach($a as $name) {
//There's some more stuff up here to prepare the strings,
//removing &/and, punctuation, making everything lower case...
$na = explode(" ", $name);
$divisions = count($na) - 1;
$poss = array();
for($i = 0; $i < pow(2, $divisions); $i++) {
$div = str_pad(decbin($i), $divisions, '0', STR_PAD_LEFT);
$tpa = array();
$tps = '';
foreach($na as $nak => $nav) {
if ($nak > 0 && substr($div, $nak - 1, 1)) {
$tpa[] = $tps;
$tps = $nav;
} else {
$tps = trim($tps . ' ' . $nav);
}
}
$tpa[] = $tps;
$poss[] = $tpa;
}
foreach($poss as $possk => $possv) {
$count = count($possv);
//Here's where I am...
//I could use $count and some math to come up with all the possible searches here,
//But my head is starting to spin as I try to think of how to do that.
}
die();
}
到目前为止,PHP将创建的名称串词所有可能安排一个数组($ POSS)。对于“约翰·朱莉·史密斯”,该阵列是这样的:
Array
(
[0] => Array
(
[0] => john julie smith
)
[1] => Array
(
[0] => john julie
[1] => smith
)
[2] => Array
(
[0] => john
[1] => julie smith
)
[3] => Array
(
[0] => john
[1] => julie
[2] => smith
)
)
最初的想法是通过数组进行迭代,并创建一个bazillion查询。为[0],将有5个查询:
... WHERE firstname = 'john julie smith';
... WHERE lastname = 'john julie smith';
... WHERE spousename = 'john julie smith';
... WHERE title = 'john julie smith';
... WHERE suffix = 'john julie smith';
但对于[1]将有20个查询:
... WHERE firstname = 'john julie' AND lastname = 'smith';
... WHERE firstname = 'john julie' AND spousename = 'smith';
... WHERE firstname = 'john julie' AND title = 'smith';
... WHERE firstname = 'john julie' AND lastname = 'smith';
... WHERE firstname = 'john julie' AND suffix = 'smith';
... WHERE lastname = 'john julie' AND firstname = 'smith';
... WHERE lastname = 'john julie' AND spousename = 'smith';
... WHERE lastname = 'john julie' AND title = 'smith';
... WHERE lastname = 'john julie' AND lastname = 'smith';
... WHERE lastname = 'john julie' AND suffix = 'smith';
//and on and on
为[3]将有60个查询!我正在以此速率查看170,000个以上的查询!
必须有更好的方法...
我加了“我做了什么”。它非常多毛,我只有一半。我希望更简化的方法,并会感谢所有的想法。 – Stevish