2015-09-20 117 views
1

我从API收集了一些数据以构建历史记录。 起初,我每5分钟保存一次所有值。后来我改变了我的程序,只保存了已经改变的数据。仅删除连续的重复行

现在我想清理我的旧数据,并删除所有计数没有更改为相同帐户+项目中的值的所有值。

account id  count time 
42  12147 492  2015-09-20 11:31:14.0 
42  12147 492  2015-09-20 11:36:19.0 // delete 
13  12147 246  2015-09-20 11:31:14.0 
2  12253 183  2015-09-20 11:36:19.0 
2  19684 805  2015-09-20 12:00:41.0 
2  19684 810  2015-09-20 12:05:41.0 
2  19684 805  2015-09-20 12:10:41.0 // we had this combination but don't delete this row because the previous value was different 
2  19684 805  2015-09-20 12:15:41.0 // delete 
2  19684 805  2015-09-20 12:20:41.0 // delete 
2  19684 806  2015-09-20 12:25:41.0 

我已经尝试过用一组来解决这个超过accountidcount,但如果连续一段时间后,再次有相同的值则会陷入同一组,这为此不起作用。

用单个SQL语句可能吗? 我也想过编写一个小脚本,我遍历所有数据并删除当前行,如果account,idcount与前一行相同,但我很好奇,如果它可能通过SQL。

+0

第三行是另一个“帐户”,因此它是按升序排列,不是吗? –

回答

2

您可以使用下面的查询:

DELETE history 
FROM history 
INNER JOIN (SELECT MIN(time) AS minTime, account, id, count 
      FROM history 
      GROUP BY account, id, count) AS h 
ON history.account = h.account AND history.id = h.id AND history.count = h.count 
WHERE history.time > h.minTime 

Demo here

编辑:

编辑完成后,我t但是在OP的样本数据中仍然存在一些错误(time字段应该按照批准顺序)。

使用表中存在一个PK的额外假设,你可以用下面的查询:

SELECT pk 
FROM history AS h1 
WHERE account = (SELECT account 
       FROM history AS h2 
       WHERE h1.account = h2.account AND 
         h1.id = h2.id AND      
         h2.time < h1.time 
       ORDER BY time DESC 
       LIMIT 1) 
     AND 
     id = (SELECT id 
      FROM history AS h2 
      WHERE h1.account = h2.account AND 
        h1.id = h2.id AND     
        h2.time < h1.time 
      ORDER BY time DESC 
      LIMIT 1) 
     AND 
     count = (SELECT count 
       FROM history AS h2 
       WHERE h1.account = h2.account AND 
        h1.id = h2.id AND      
        h2.time < h1.time 
       ORDER BY time DESC 
       LIMIT 1) 

,以确定到去删除记录(见this demo)。

DELETE FROM history 
WHERE pk IN (
SELECT x.pk 
FROM (    
    SELECT pk 
    FROM history AS h1 
    WHERE 
    account = (SELECT account 
       FROM history AS h2 
       WHERE h1.account = h2.account AND 
         h1.id = h2.id AND      
         h2.time < h1.time 
         ORDER BY time DESC 
         LIMIT 1) 

    AND 

    id = (SELECT id 
      FROM history AS h2 
      WHERE h1.account = h2.account AND 
       h1.id = h2.id AND     
       h2.time < h1.time 
      ORDER BY time DESC 
      LIMIT 1) 

    AND 

    count = (SELECT count 
       FROM history AS h2 
       WHERE h1.account = h2.account AND 
        h1.id = h2.id AND      
        h2.time < h1.time 
       ORDER BY time DESC 
       LIMIT 1)) AS x) 

Demo here

编辑2:

使用NOT IN运营商现在就可以很容易地删除不需要的行

使用,以便变量位于德缺失pk值可能导致查询速度相当快:

SELECT pk 
FROM (
    SELECT pk, account, id, count, time, 
     @rn := IF (account = @acc AND id = @id AND count = @count, 
        @rn + 1, 1) AS rn, 
     @acc := account, 
     @id := id, 
     @count := count 
    FROM history 
    CROSS JOIN (SELECT @rn = 0, @acc = 0, @id = 0, @count = 0) AS vars 
    ORDER BY account, id, time, count) AS t 
WHERE t.rn > 1 

Demo here

+0

这个演示很棒。但是,账户2 id19684的问题是805,然后上升到810并回到805.这些都是有效的更改。但是接下来只有两个805应该被删除,而不是第一个,但在810之后是806。 –

+0

我刚才看到我在我的问题中标记了一行太多而无法删除。编辑它。对不起... –

+0

@dasKeks我编辑了我的答案,它现在应该工作。 –

0

可以删除所有,但第一本(未经测试)代码:

delete from history h1 
where exists (select h2 
       from history 
       where 
       h1.account = h2.account and 
       h1.id = h2.id and 
       h1.count = h2.count and 
       h1.time < h2.time 
      ) 
+1

我认为是'h1.time *> * h2.time',因为OP要保留较旧的记录 –