2013-09-30 70 views
3

我试图紧缩一些狂欢观看统计数据,我想找出最长的狂欢连胜多久看(一狂欢是多个程序最长的狂欢,观看连胜相互让步,相隔不超过2小时)。数据看起来像这样:计算使用SQL

datetime    user_id program 
2013-09-01 00:01:18  1  A 
2013-09-10 14:03:14  1  B 
2013-09-20 17:02:12  2  A 
2013-09-21 00:03:22  2  C <-- user 2 binge start 
2013-09-21 01:23:22  2  M 
2013-09-21 03:03:22  2  E 
2013-09-21 04:03:22  2  F 
2013-09-21 06:03:22  2  G <-- user 2 binge end 
2013-09-21 09:03:22  2  H 
2013-09-03 18:21:09  3  D 
2013-09-21 09:03:22  2  H 
2013-09-24 19:21:00  2  X <-- user 2 second binge start 
2013-09-24 20:21:00  2  Y 
2013-09-24 21:21:00  2  Z <-- user 2 second binge end 

SQL Fiddle

在这个例子中用户2具有,历时6小时,狂饮后来另一个,历时2小时。

最终的结果,我想是这样的:

user_id  binge  length 
2   1   6 hours 
2   2   2 hours 

这能直接在数据库中计算出来的?

+1

我affraid我不知道什么是“冰”,以及如何衡量是否是“长”够了。如果你可以在你的问题中正确地描述这一点,并且还可以添加更完整的示例,那么我将能够查看可能的SQL解决方案。 – vyegorov

+1

“左右”是什么意思?你的意思是你想要连续的行数在下一行的时间在前1.5-2.5小时?哦,这是Postgres,你可以做任何事情,甚至很多你不应该做的事情。 –

+0

@JakubKania准确地说,下一个视图的时间比以前少说2小时。我会试验这个限制。 – jenswirf

回答

3

这是识别所述数据序列/条纹的问题。我的优选这样做的方法是,

  • 使用LAG功能来识别每个条纹
  • 的开始通过用于进一步处理这个唯一的编号使用SUM函数到一个唯一的编号分配给每个条纹
  • 然后组

查询:

with start_grp as (
    select dt, user_id, programme, 
     case when dt - lag(dt,1) over (partition by user_id order by dt) 
        > interval '0 day 2:00:00' 
       then 1 
       else 0 
     end grp_start 
    from binge 
), 
assign_grp as (
    select dt, user_id, programme, 
    sum(grp_start) over (partition by user_id order by dt) grp 
    from start_grp) 
select user_id, grp as binge, max(dt) - min(dt) as binge_length 
from assign_grp 
group by user_id, grp 
having count(programme) > 1 

这里狂欢列可能不会在塞克本质的方式。您可以在最终查询中使用ROW_NUMBER函数对其进行更正。

演示在sqlfiddle

+0

使用'grp_start'确实比我所做的递归CTE更简单。 – Bruno

1

这是一个使用recursive CTE(它不是“递归”,但这就是它们的名称)和window functions的解决方案。至少为此需要使用PostgreSQL 8.4。

SQL Fiddle

的PostgreSQL 9.1.9架构设置

CREATE TABLE viewings (
    user_id INTEGER NOT NULL, 
    datetime TIMESTAMPTZ NOT NULL, 
    programme TEXT NOT NULL, 
    PRIMARY KEY (user_id, datetime) 
); 

INSERT INTO viewings (datetime, user_id, programme) VALUES 
('2013-09-01 00:01:18', 1, 'A'), 
('2013-09-10 14:03:14', 1, 'B'), 
('2013-09-20 17:02:12', 2, 'A'), 
('2013-09-21 00:03:22', 2, 'C'), 
('2013-09-21 01:23:22', 2, 'M'), 
('2013-09-21 03:03:22', 2, 'E'), 
('2013-09-21 04:03:22', 2, 'F'), 
('2013-09-21 06:03:22', 2, 'G'), 
('2013-09-21 09:03:22', 2, 'H'), 
('2013-09-03 18:21:09', 3, 'D'), 
('2013-09-22 09:03:22', 2, 'H'), 
('2013-09-24 19:21:00', 2, 'X'), 
('2013-09-24 20:21:00', 2, 'Y'), 
('2013-09-24 21:21:00', 2, 'Z'); 

查询1

WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
    WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id, 
      datetime, LAG(datetime) OVER w AS prev_datetime, 
      programme 
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime) 
) 
    SELECT user_id, datetime AS first_dt, datetime AS last_dt 
    FROM lagged_viewings 
    WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime 
    UNION ALL 
    SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt 
    FROM consecutive_viewings cv 
     INNER JOIN lagged_viewings lv 
     ON lv.user_id=cv.user_id AND 
     lv.prev_datetime=cv.last_dt 
     WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime 
) 
SELECT user_id, first_dt, MAX(last_dt) AS last_dt 
    FROM consecutive_viewings 
    WHERE first_dt != last_dt 
    GROUP BY user_id, first_dt 
    ORDER BY user_id, first_dt 

Results

| USER_ID |       FIRST_DT |       LAST_DT | 
|---------|----------------------------------|----------------------------------| 
|  2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 | 
|  2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 | 

要理解这一点,它可能更容易入手的最嵌套的CTE。这将按user_iddatetime排序查看,但这也会添加一个带有先前查看时间戳的额外列,以便您可以稍后链接它们。这不是一个递归CTE(和CTE甚至没有需要对自己下面的查询):

查询2

WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id, 
      datetime, LAG(datetime) OVER w AS prev_datetime, 
      programme 
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime) 
) 
SELECT * FROM lagged_viewings 

Results

| USER_ID | PREV_USER_ID |       DATETIME |     PREV_DATETIME | PROGRAMME | 
|---------|--------------|----------------------------------|----------------------------------|-----------| 
|  1 |  (null) | September, 01 2013 00:01:18+0000 |       (null) |   A | 
|  1 |   1 | September, 10 2013 14:03:14+0000 | September, 01 2013 00:01:18+0000 |   B | 
|  2 |  (null) | September, 20 2013 17:02:12+0000 |       (null) |   A | 
|  2 |   2 | September, 21 2013 00:03:22+0000 | September, 20 2013 17:02:12+0000 |   C | 
|  2 |   2 | September, 21 2013 01:23:22+0000 | September, 21 2013 00:03:22+0000 |   M | 
|  2 |   2 | September, 21 2013 03:03:22+0000 | September, 21 2013 01:23:22+0000 |   E | 
|  2 |   2 | September, 21 2013 04:03:22+0000 | September, 21 2013 03:03:22+0000 |   F | 
|  2 |   2 | September, 21 2013 06:03:22+0000 | September, 21 2013 04:03:22+0000 |   G | 
|  2 |   2 | September, 21 2013 09:03:22+0000 | September, 21 2013 06:03:22+0000 |   H | 
|  2 |   2 | September, 22 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 |   H | 
|  2 |   2 | September, 24 2013 19:21:00+0000 | September, 22 2013 09:03:22+0000 |   X | 
|  2 |   2 | September, 24 2013 20:21:00+0000 | September, 24 2013 19:21:00+0000 |   Y | 
|  2 |   2 | September, 24 2013 21:21:00+0000 | September, 24 2013 20:21:00+0000 |   Z | 
|  3 |  (null) | September, 03 2013 18:21:09+0000 |       (null) |   D | 

这个递归的CTE id可能有点难理解。 “递归”依赖于两个select语句之间的联合。

  • 第一个种子迭代(它的非递归部分):它会发现,是收看的链(即在以前的日期时间为空的开始行,如果它是第一个针对该用户,或者之前的日期时间超过了您的截止时间间隔)。
  • 第二个锁链观看时间较长。有些持续时间会重叠,因为它不知道什么时候结束。这是使用条件(在顶部的整个查询中)的地方,以查找最大值并消除单次查看的时间段。

查询3

WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
    WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id, 
      datetime, LAG(datetime) OVER w AS prev_datetime, 
      programme 
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime) 
) 
    -- These are the starts of the "binge" durations 
    SELECT user_id, datetime AS first_dt, datetime AS last_dt 
    FROM lagged_viewings 
    WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime 
    UNION ALL 
    -- These are the extended periods 
    SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt 
    FROM consecutive_viewings cv 
     INNER JOIN lagged_viewings lv 
     ON lv.user_id=cv.user_id AND 
     lv.prev_datetime=cv.last_dt 
     WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime 
) 
SELECT * FROM consecutive_viewings 
    ORDER BY user_id, first_dt, last_dt 

Results

| USER_ID |       FIRST_DT |       LAST_DT | 
|---------|----------------------------------|----------------------------------| 
|  1 | September, 01 2013 00:01:18+0000 | September, 01 2013 00:01:18+0000 | 
|  1 | September, 10 2013 14:03:14+0000 | September, 10 2013 14:03:14+0000 | 
|  2 | September, 20 2013 17:02:12+0000 | September, 20 2013 17:02:12+0000 | 
|  2 | September, 21 2013 00:03:22+0000 | September, 21 2013 00:03:22+0000 | 
|  2 | September, 21 2013 00:03:22+0000 | September, 21 2013 01:23:22+0000 | 
|  2 | September, 21 2013 00:03:22+0000 | September, 21 2013 03:03:22+0000 | 
|  2 | September, 21 2013 00:03:22+0000 | September, 21 2013 04:03:22+0000 | 
|  2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 | 
|  2 | September, 21 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 | 
|  2 | September, 22 2013 09:03:22+0000 | September, 22 2013 09:03:22+0000 | 
|  2 | September, 24 2013 19:21:00+0000 | September, 24 2013 19:21:00+0000 | 
|  2 | September, 24 2013 19:21:00+0000 | September, 24 2013 20:21:00+0000 | 
|  2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 | 
|  3 | September, 03 2013 18:21:09+0000 | September, 03 2013 18:21:09+0000 |