2014-07-22 127 views
0

我们正在研究大约1300万行的表格。我们的目标是只在一个餐厅(〜约300,000行)中查找此表中的重复项。我们的重复标准是姓氏相同,名字相同的前两个字母,以及相同的电话或电子邮件。这些都是他们自己的专栏。我们现在的策略是为餐厅的所有行创建两个相同的临时表,然后按照上述条件加入它们,然后从第一个表中返回id,名,姓,电话和电子邮件。优化SQL重复搜索

SELECT 
    DISTINCT t1.id, t1.firstname, t1.lastname, t1.phone, t1.email 
FROM 
(
    SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email 
    FROM loyalty_member_opentable_customer lmoc 
    WHERE lmoc.opentable_restaurant_id=2296 
     AND lmoc.lastname NOT LIKE '%Tour%' 
) AS t1 
INNER JOIN 
(
    SELECT lmoc2.id, lmoc2.firstname, lmoc2.lastname, lmoc2.phone, lmoc2.email 
    FROM loyalty_member_opentable_customer lmoc2 
    WHERE lmoc2.opentable_restaurant_id=2296 
     AND lmoc2.lastname NOT LIKE '%Tour%' 
) AS t2 
    ON STRCMP(t1.lastname,t2.lastname)=0 
    AND t1.id!=t2.id 
    AND STRCMP(LEFT(t1.firstname,2),LEFT(t2.firstname,2))=0 
    AND (STRCMP(t1.phone,t2.phone)=0 OR STRCMP(t1.email,t2.email)=0) 
ORDER BY t1.lastname, t1.firstname 

问题是这个查询需要48小时的北方运行。任何人都可以想到一个更有效的方法来运行它?我们需要所有重复项目,以便餐厅能够按照他们认为合适的方式合并它们。

+2

听起来像是一个很好的策略。玩的开心。 – Strawberry

+1

这个问题似乎是无关紧要的,因为没有问题。 – Strawberry

+0

如果您发布表结构和SQL查询,这会很有用。此外,有关当前性能的一些信息将有助于衡量可以改进的地方。尝试将其重新翻译为一个问题。 –

回答

1

为什么不能简单地做

SELECT lmoc.lastname, lmoc.firstname, lmoc.phone, lmoc.email 
FROM loyalty_member_opentable_customer lmoc 
WHERE lmoc.opentable_restaurant_id=2296 
    AND lmoc.lastname NOT LIKE '%Tour%' 
GROUP BY lmoc.lastname, LEFT(lmoc.firstname, 2), lmoc.phone, lmoc.email 
HAVING COUNT(*) > 1; 

+0

这消除了标准的电话或电子邮件匹配方面。有些重复电话有相匹配的电话,有些重复电话有相匹配的电子邮件,但很少有重复的电话。我们也希望拥有两个重复的ID,以便我们可以将它们组合起来。 – Zak

1

这个SQL将帮助你找到重复的

SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email 
FROM loyalty_member_opentable_customer lmoc 
WHERE lmoc.opentable_restaurant_id=2296 
    AND lmoc.lastname NOT LIKE '%Tour%' 
    AND lmoc.lastname BETWEEN 'ha' AND 'i' 
GROUP BY lmoc.opentable_restaurant_id, lmoc.id, LEFT(lmoc.firstname,2), lmoc.lastname, lmoc.phone, lmoc.email 
HAVING COUNT(*) > 1  

如果你有一个主键,就可以轻松地保持最近的一个和清除旧的,这个SQL

DELETE 
     lmoc.primary_id 
FROM loyalty_member_opentable_customer lmoc 
LEFT JOIN 
    (SELECT 
     MAX(lmoc.primary_id) AS id 
    FROM loyalty_member_opentable_customer lmoc 
    WHERE lmoc.opentable_restaurant_id=2296 
     AND lmoc.lastname NOT LIKE '%Tour%' 
     AND lmoc.lastname BETWEEN 'ha' AND 'i' 
    GROUP BY lmoc.opentable_restaurant_id, lmoc.id, LEFT(lmoc.firstname,2), lmoc.lastname, lmoc.phone, lmoc.email 
    ) nodup 
    ON adjuster.id = nodup.id 
WHERE lmoc.opentable_restaurant_id=2296 
     AND lmoc.lastname NOT LIKE '%Tour%' 
     AND lmoc.lastname BETWEEN 'ha' AND 'i' 
     AND nodup.id IS NULL"; 
+0

我没有'lmoc.lastname BETWEEN'ha'和'i''? – ForguesR

+0

我刚刚从扎克的问题中获得了WHERE条件。好像他之后编辑它。 –

1

你不是在创建一个临时表,而是使用子查询,并且这将会有1300万行慢。用您需要的全部数据创建一个真正的临时表(SELECT INTO)。

这是我想尝试:

/* Creating a temporary table */ 
SELECT lmoc.id, lmoc.firstname, lmoc.lastname, lmoc.phone, lmoc.email 
INTO tempRestaurant 
FROM loyalty_member_opentable_customer AS lmoc 
WHERE 
    lmoc.opentable_restaurant_id=2296 AND 
    lmoc.lastname NOT LIKE '%Tour%' 

/* Select duplicates */ 
SELECT * FROM 
    tempRestaurant AS t1 
INNER JOIN tempRestaurant AS t2 ON 
    STRCMP(t1.lastname,t2.lastname)=0 
    AND t1.id!=t2.id 
WHERE 
    STRCMP(LEFT(t1.firstname,2), LEFT(t2.firstname,2))=0 AND 
    (STRCMP(t1.phone,t2.phone)=0 OR STRCMP(t1.email,t2.email)=0)