2010-05-20 58 views
7

重复值,我使用的是SQL Server 2008中我有一个表如何找到在SQL Server

Customers 

customer_number int 

field1 varchar 

field2 varchar 

field3 varchar 

field4 varchar 

...和更大量列,不为我的查询关系。

Column customer_number is pk。我试图找到重复的值和它们之间的一些差异。

请帮我找到有相同

1)字段1,字段2,字段3的所有行,字段4

2)只有3列是相等的,其中之一是不(除了来自列表1行)

3)只有2列相等,并且其中的两个是不(除了来自列表1和表2)

最后,我将得到3个表,其中包含此结果和其他groupId,对于一组相似的结果将是相同的(例如,对于3列等于,具有相同列的3个行将等于一个单独的组)

谢谢。

回答

4

最简单的做法可能是编写一个存储过程,对每组客户重复进行重复操作,并分别为每个组编号插入匹配的过程。

不过,我已经想过了,你可以用子查询来做到这一点。希望我没有把它变得比它应该更复杂,但是这应该让你找到你想要的第一张重复表格(全部四个字段)。请注意,这是未经测试的,所以可能需要稍微调整。

基本上,它会获取每组字段,其中有重复的每个组的编号,然后获取所有客户的这些字段并分配相同的组编号。

INSERT INTO FourFieldsDuplicates(group_no, customer_no) 
SELECT Groups.group_no, custs.customer_no 
FROM (SELECT ROW_NUMBER() OVER(ORDER BY c.field1) AS group_no, 
      c.field1, c.field2, c.field3, c.field4 
     FROM Customers c 
     GROUP BY c.field1, c.field2, c.field3, c.field4 
     HAVING COUNT(*) > 1) Groups 
INNER JOIN Customers custs ON custs.field1 = Groups.field1 
          AND custs.field2 = Groups.field2 
          AND custs.field3 = Groups.field3 
          AND custs.field4 = Groups.field4 

其他的更复杂一点,但是你需要扩大可能性。这三个字段组然后是:

INSERT INTO ThreeFieldsDuplicates(group_no, customer_no) 
SELECT Groups.group_no, custs.customer_no 
FROM (SELECT ROW_NUMBER() OVER(ORDER BY GroupsInner.field1) AS group_no, 
      GroupsInner.field1, GroupsInner.field2, 
      GroupsInner.field3, GroupsInner.field4 
     FROM (SELECT c.field1, c.field2, c.field3, NULL AS field4 
      FROM Customers c 
      WHERE NOT EXISTS(SELECT d.customer_no 
         FROM FourFieldsDuplicates d 
         WHERE d.customer_no = c.customer_no) 
      GROUP BY c.field1, c.field2, c.field3 
      UNION ALL 
      SELECT c.field1, c.field2, NULL AS field3, c.field4 
      FROM Customers c 
      WHERE NOT EXISTS(SELECT d.customer_no 
          FROM FourFieldsDuplicates d 
          WHERE d.customer_no = c.customer_no) 
      GROUP BY c.field1, c.field2, c.field4 
      UNION ALL 
      SELECT c.field1, NULL AS field2, c.field3, c.field4 
      FROM Customers c 
      WHERE NOT EXISTS(SELECT d.customer_no 
          FROM FourFieldsDuplicates d 
          WHERE d.customer_no = c.customer_no) 
      GROUP BY c.field1, c.field3, c.field4 
      UNION ALL 
      SELECT NULL AS field1, c.field2, c.field3, c.field4 
      FROM Customers c 
      WHERE NOT EXISTS(SELECT d.customer_no 
          FROM FourFieldsDuplicates d 
          WHERE d.customer_no = c.customer_no) 
      GROUP BY c.field2, c.field3, c.field4) GroupsInner 
     GROUP BY GroupsInner.field1, GroupsInner.field2, 
       GroupsInner.field3, GroupsInner.field4 
     HAVING COUNT(*) > 1) Groups 
INNER JOIN Customers custs ON (Groups.field1 IS NULL OR custs.field1 = Groups.field1) 
          AND (Groups.field2 IS NULL OR custs.field2 = Groups.field2) 
          AND (Groups.field3 IS NULL OR custs.field3 = Groups.field3) 
          AND (Groups.field4 IS NULL OR custs.field4 = Groups.field4) 

希望这会产生正确的结果,我将离开最后一个作为练习。 :-D

+0

@Ic是不是写“c.field1为group_no”的技术,这回原来的表? group_no是int,field1是varchar。 也许我应该使用一些临时表? – hgulyan 2010-05-20 10:18:57

+0

@hgulyan它实际上是ROW_NUMBER()作为group_no。 – 2010-05-20 10:25:48

+0

@Ic,是的,我已经改变了。仍然试图运行第一个脚本... – hgulyan 2010-05-20 10:27:49

2

我不确定您是否需要在不同字段(如field1 = field2)上进行相等性检查。
否则这可能就够了。

编辑

随意调整TESTDATA向我们提供,让根据您的规格错误的输出输入。

测试数据

DECLARE @Customers TABLE (
    customer_number INTEGER IDENTITY(1, 1) 
    , field1 INTEGER 
    , field2 INTEGER 
    , field3 INTEGER 
    , field4 INTEGER) 

INSERT INTO @Customers 
      SELECT 1, 1, 1, 1 
UNION ALL SELECT 1, 1, 1, 1 
UNION ALL SELECT 1, 1, 1, NULL 
UNION ALL SELECT 1, 1, 1, 2 
UNION ALL SELECT 1, 1, 1, 3 
UNION ALL SELECT 2, 1, 1, 1 

人人平等

SELECT ROW_NUMBER() OVER (ORDER BY c1.customer_number) 
     , c1.field1 
     , c1.field2 
     , c1.field3 
     , c1.field4 
FROM @Customers c1 
     INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number 
            AND ISNULL(c2.field1, 0) = ISNULL(c1.field1, 0) 
            AND ISNULL(c2.field2, 0) = ISNULL(c1.field2, 0) 
            AND ISNULL(c2.field3, 0) = ISNULL(c1.field3, 0) 
            AND ISNULL(c2.field4, 0) = ISNULL(c1.field4, 0) 

一个不同领域的

SELECT ROW_NUMBER() OVER (ORDER BY field1, field2, field3, field4) 
     , field1 
     , field2 
     , field3 
     , field4 
FROM (
      SELECT DISTINCT c1.field1 
        , c1.field2 
        , c1.field3 
        , field4 = NULL 
      FROM @Customers c1 
        INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number 
              AND c2.field1 = c1.field1 
              AND c2.field2 = c1.field2 
              AND c2.field3 = c1.field3 
              AND ISNULL(c2.field4, 0) <> ISNULL(c1.field4, 0) 
      UNION ALL 
      SELECT DISTINCT c1.field1 
        , c1.field2 
        , NULL 
        , c1.field4 
      FROM @Customers c1 
        INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number 
              AND c2.field1 = c1.field1 
              AND c2.field2 = c1.field2 
              AND ISNULL(c2.field3, 0) <> ISNULL(c1.field3, 0) 
              AND c2.field4 = c1.field4 
      UNION ALL 
      SELECT DISTINCT c1.field1 
        , NULL 
        , c1.field3 
        , c1.field4 
      FROM @Customers c1 
        INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number 
              AND c2.field1 = c1.field1 
              AND ISNULL(c2.field2, 0) <> ISNULL(c1.field2, 0) 
              AND c2.field3 = c1.field3 
              AND c2.field4 = c1.field4 
      UNION ALL 
      SELECT DISTINCT NULL 
        , c1.field2 
        , c1.field3 
        , c1.field4 
      FROM @Customers c1 
        INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number 
              AND ISNULL(c2.field1, 0) <> ISNULL(c1.field1, 0) 
              AND c2.field2 = c1.field2 
              AND c2.field3 = c1.field3 
              AND c2.field4 = c1.field4 
    ) c 
+0

如果有一些空值,INNER JOIN是否会工作? – hgulyan 2010-05-20 11:19:08

+0

这很好,实际上我起初是这样做的,但是问题是再次将组插入到新表中...... – 2010-05-20 11:51:36

+0

是否有任何方法将rownumber添加到此查询中? – hgulyan 2010-05-20 12:02:44

52

这里有一个方便的q用于查找表中的重复项。假设你想找到存在一次以上表中的所有电子邮件地址:

SELECT email, COUNT(email) AS NumOccurrences 
FROM users 
GROUP BY email 
HAVING (COUNT(email) > 1) 

您也可以使用这种技术来发现出现一次行:

SELECT email 
FROM users 
GROUP BY email 
HAVING (COUNT(email) = 1) 
+2

简单,美丽的答案。我可以想到这一点,但我问谷歌,因为我不想考虑。我没有失望。这是真正的答案。 – 2012-07-18 19:45:08

+0

很好玩先生。 – Induster 2012-11-28 19:36:58

+0

简单而且效果很好。非常感谢你! – Migs 2013-01-31 17:52:08

0

你可以写简单的东西这样算重复条目,我认为它的工作:

use *DATABASE_NAME* 
go 
SELECT  *YOUR_FIELD*, COUNT(*) AS dupes 
FROM   *YOUR_TABLE_NAME* 
GROUP BY *YOUR_FIELD* 
HAVING  (COUNT(*) > 1) 

享受

+0

上面的答案重复 – 2013-09-10 10:01:52

0

没有与CUBE()这样做的清洁方式,这将是所有可能的组合列

SELECT 
    field1,field2,field3,field4 
,duplicate_row_count = COUNT(*) 
,grp_id = GROUPING_ID(field1,field2,field3,field4) 
INTO #duplicate_rows 
FROM table_name 
GROUP BY CUBE(field1,field2,field3,field4) 
HAVING COUNT(*) > 1 
    AND GROUPING_ID(field1,field2,field3,field4) IN (0,1,2,4,8,3,5,6,9,10,12) 

的数字(0,1,2,4,8,3,5,6聚集, 9,10,12)只是我们关心的分组集合的位掩码(0000,0001,0010,0100,...,1010,1100) - 那些有4,3或2个匹配的分组集合。

然后加入使用,在#duplicate_rows将NULL作为通配符

SELECT a.* 
FROM table_name a 
INNER JOIN #duplicate_rows b 
    ON NULLIF(b.field1,a.field1) IS NULL 
    AND NULLIF(b.field2,a.field2) IS NULL 
    AND NULLIF(b.field3,a.field3) IS NULL 
    AND NULLIF(b.field4,a.field4) IS NULL 
--WHERE grp_id IN (0)    --Use this for 4 matches 
--WHERE grp_id IN (1,2,4,8)  --Use this for 3 matches 
--WHERE grp_id IN (3,5,6,9,10,12) --Use this for 2 matches