2013-09-30 56 views
1

比方说,我创建了主表中包含的基本联系信息和电话号码的子表中的地址簿 -如何检测重复记录与子表中的记录

Contact 
=============== 
Id   [PK] 
Name 

PhoneNumber 
=============== 
Id   [PK] 
Contact_Id [FK] 
Number 

因此,联系人记录PhoneNumber表中可能有零个或多个相关记录。对主键以外的任何列的唯一性没有限制。事实上,这必须是真实的,因为:具有不同的名称

  1. 两个触点可以共享一个电话号码,并
  2. 两个触点可能具有相同的名称,但不同的电话号码。

我想将可能包含重复记录的大型数据集导入到我的数据库中,然后使用SQL过滤出重复项。用于识别重复记录的规则很简单...他们必须共享具有相同内容的相同姓名和相同数量的电话记录。

当然,这个工作相当有效从联系表中选择重复,但不会帮助我发现给我的规则实际重复:

SELECT * FROM Contact 
WHERE EXISTS 
    (SELECT 'x' FROM Contact t2 
    WHERE t2.Name = Contact.Name AND 
      t2.Id > Contact.Id); 

看起来好像是我要的是一个合乎逻辑的延伸我已经拥有了,但我必须忽略它。任何帮助?

谢谢!

+0

你需要加入两个表,按名称分组,然后使用'HAVING'子句来获得COUNT(Id)> 1'' – mrtig

+0

应该有一个唯一的约束'(PhoneNumber.Contact_Id,PhoneNumber.Number)',虽然。否则,您将面临多次为相同联系人ID存储相同编号的风险(顺便提一下,导入该大型数据集时可能会使确定重复*集*数据变得更加困难)。 –

+0

Andriy的评论是一个很好的评论。但是,如果公用事业公司希望以最少的验证提取数据并稍后进行清理,那么最好创建一组没有这种限制的缓存表,如他所说,在最终的表上。 – 240DL

回答

0

笔者表示:“两个人是同一个人”为要求:

  1. 具有相同的名称和
  2. 具有相同数目的电话号码所有这些都是一样的。

所以这个问题比看起来更复杂一点(或者我可能只是推翻了它)。

样本数据和(一个丑陋的一个,我知道,但总的想法是有),我测试了下面这似乎是正常工作的测试数据的样本查询(我使用Oracle 11g R2):

CREATE TABLE contact (
    id NUMBER PRIMARY KEY, 
    name VARCHAR2(40)) 
; 

CREATE TABLE phone_number (
    id NUMBER PRIMARY KEY, 
    contact_id REFERENCES contact (id), 
    phone VARCHAR2(10) 
); 

INSERT INTO contact (id, name) VALUES (1, 'John'); 
INSERT INTO contact (id, name) VALUES (2, 'John'); 
INSERT INTO contact (id, name) VALUES (3, 'Peter'); 
INSERT INTO contact (id, name) VALUES (4, 'Peter'); 
INSERT INTO contact (id, name) VALUES (5, 'Mike'); 
INSERT INTO contact (id, name) VALUES (6, 'Mike'); 
INSERT INTO contact (id, name) VALUES (7, 'Mike'); 

INSERT INTO phone_number (id, contact_id, phone) VALUES (1, 1, '123'); -- John having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (2, 1, '456'); -- John having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (3, 2, '123'); -- John the second having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (4, 2, '456'); -- John the second having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (5, 3, '123'); -- Peter having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (6, 3, '456'); -- Peter having number 123 
INSERT INTO phone_number (id, contact_id, phone) VALUES (7, 3, '789'); -- Peter having number 123 

INSERT INTO phone_number (id, contact_id, phone) VALUES (8, 4, '456'); -- Peter the second having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (9, 5, '123'); -- Mike having number 456 
INSERT INTO phone_number (id, contact_id, phone) VALUES (10, 5, '456'); -- Mike having number 456 

INSERT INTO phone_number (id, contact_id, phone) VALUES (11, 6, '123'); -- Mike the second having number 456 
INSERT INTO phone_number (id, contact_id, phone) VALUES (12, 6, '789'); -- Mike the second having number 456 

-- Mike the third having no number 
COMMIT; 

-- does not meet the requirements described in the question - will return Peter when it should not 
SELECT DISTINCT c.name 
    FROM contact c JOIN phone_number pn ON (pn.contact_id = c.id) 
GROUP BY name, phone_number 
HAVING COUNT(c.id) > 1 
; 

-- returns correct results for provided test data 
-- take all people that have a namesake in contact table and 
-- take all this person's phone numbers that this person's namesake also has 
-- finally (outer query) check that the number of both persons' phone numbers is the same and 
-- the number of the same phone numbers is equal to the number of (either) person's phone numbers 
SELECT c1_id, name 
    FROM (
    SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt 
     FROM contact c1 
     JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name) 
     JOIN phone_number pn ON (pn.contact_id = c1.id) 
    WHERE 
     EXISTS (SELECT 1 
       FROM phone_number 
       WHERE contact_id = c2.id 
       AND phone = pn.phone) 
    GROUP BY c1.id, c1.name, c2.id 
) 
WHERE cnt = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) 
    AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id) 
; 

-- cleanup 
DROP TABLE phone_number; 
DROP TABLE contact; 

检查在SQL小提琴:http://www.sqlfiddle.com/#!4/36cdf/1

编辑

答到作者的评论:当然,我并没有考虑到这一点?这里有一个修订的解决方案:

-- new test data 
INSERT INTO contact (id, name) VALUES (8, 'Jane'); 
INSERT INTO contact (id, name) VALUES (9, 'Jane'); 

SELECT c1_id, name 
    FROM (
    SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt 
     FROM contact c1 
     JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name) 
     LEFT JOIN phone_number pn ON (pn.contact_id = c1.id) 
    WHERE pn.contact_id IS NULL 
     OR EXISTS (SELECT 1 
       FROM phone_number 
       WHERE contact_id = c2.id 
       AND phone = pn.phone) 
    GROUP BY c1.id, c1.name, c2.id 
) 
WHERE (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) IN (0, cnt) 
    AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id) 
; 

我们允许的情况时,有没有电话号码(LEFT JOIN)和外部查询我们现在比较人的电话号码的数字 - 它必须是等于0,或从内部查询返回的编号。

+0

谢谢!我想,这是正确的道路。但是联系人记录的电话记录数为零的情况呢? – 240DL

+0

零关联记录意味着没有重复记录。你不是在寻找重复的东西吗? –

+0

那么他们共享相同的名称,并具有相同数量的电话号码...如果通过评论作者意味着: “这个查询将返回什么时候会有人有相同的名字,都没有电话号码? “ Then: - 查询的第一个版本不会返回它们。 - 修改后的查询会。 –

0

关键词“有”是你的朋友。通用的用途是:

select field1, field2, count(*) records 
from whereever 
where whatever 
group by field1, field2 
having records > 1 

是否可以在having子句中使用别名取决于数据库引擎。你应该能够将这个基本原则应用于你的情况。

1

在我的问题中,我创建了一个大大简化的模式,反映了我正在解决的现实世界问题。 Przemyslaw的回答确实是一个正确的答案,并且按照我对样本模式以及扩展后的模式进行了询问。

但是,在对真实模式和较大(〜10k记录)数据集进行了一些实验之后,我发现性能是一个问题。我并不声称自己是索引大师,但我无法找到比模式中已有索引更好的索引组合。

所以,我想出了一个替代解决方案,它满足相同的要求,但在一小部分时间内(至少使用SQLite3 - 我的生产引擎)执行一小部分(< 10%)。希望它可以帮助别人,我会提供它作为我的问题的替代答案。

DROP TABLE IF EXISTS Contact; 
DROP TABLE IF EXISTS PhoneNumber; 

CREATE TABLE Contact (
    Id INTEGER PRIMARY KEY, 
    Name TEXT 
); 

CREATE TABLE PhoneNumber (
    Id   INTEGER PRIMARY KEY, 
    Contact_Id INTEGER REFERENCES Contact (Id) ON UPDATE CASCADE ON DELETE CASCADE, 
    Number  TEXT 
); 

INSERT INTO Contact (Id, Name) VALUES 
    (1, 'John Smith'), 
    (2, 'John Smith'), 
    (3, 'John Smith'), 
    (4, 'Jane Smith'), 
    (5, 'Bob Smith'), 
    (6, 'Bob Smith'); 

INSERT INTO PhoneNumber (Id, Contact_Id, Number) VALUES 
    (1, 1, '555-1212'), 
    (2, 1, '222-1515'), 
    (3, 2, '222-1515'), 
    (4, 2, '555-1212'), 
    (5, 3, '111-2525'), 
    (6, 4, '111-2525'); 

COMMIT; 

SELECT * 
FROM Contact c1 
WHERE EXISTS (
    SELECT 1 
    FROM Contact c2 
    WHERE c2.Id > c1.Id 
    AND c2.Name = c1.Name 
    AND (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c2.Id) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id) 
    AND (
     SELECT COUNT(*) 
     FROM PhoneNumber p1 
     WHERE p1.Contact_Id = c2.Id 
     AND EXISTS (
      SELECT 1 
      FROM PhoneNumber p2 
      WHERE p2.Contact_Id = c1.Id 
      AND p2.Number = p1.Number 
     ) 
    ) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id) 
) 
; 

结果如预期:

Id  Name 
====== ============= 
1  John Smith 
5  Bob Smith 

其他引擎也必然有不同的性能可能是完全可以接受的。这个解决方案似乎对于这个模式的SQLite来说工作得很好。