2014-10-05 80 views
2

面对的查询设计问题,不知道我的解决问题的方法是否是不必要的复杂内窗口功能反对目前的分析查询的其中一个(例如)将是:PostgreSQL的窗口功能

with intervals as (
    select 
    (select '09/27/2014'::date) + (n  || ' minutes')::interval start_time, 
    (select '09/27/2014'::date) + ((n+60) || ' minutes')::interval end_time 
     from generate_series(0, (24*60*7), 60 * 4) n 
) 
    select 
    extract(epoch from i.start_time)::numeric * 1000 as ts, 
    extract(epoch from i.end_time)::numeric * 1000 as end_ts, 
    sum(avg(messages.score)) over (order by i.start_time) as score 

    from messages 
    right join intervals i 
    on messages.timestamp >= i.start_time and messages.timestamp < i.end_time 

    where messages.timestamp between '09/27/2014' and '10/04/2014' 

    group by i.start_time, i.end_time 
    order by i.start_time 

正如你们可能会说 - 这个查询计算“得分” attribut的平均e用于给定时间桶分布的消息,然后与其一起计算桶(使用窗口)的累积。

接下来我要做的是找到最接近每个存储桶平均值的前5(例如)messages.text

现在,我唯一的计划是:

1) Join messages with the time-buckets 
2) Compute a score - avg(score) over (partition by start_time) as deviation and save it against each record of the joined relation 
3) Compute a rank() over (order by deviation) as rank 
4) Select where rank between 1 and 5 

我之所以把这个下来势在必行的步骤,因为我第一次尝试在未来与参与设计使用中的窗口函数窗口函数(rank() over (partition by start_time, order by score - avg(score) over (partition by start_time)),我甚至没有试图去查看它是否可行。

请问我能否就正确的方向迈向一些建议?

+0

注意:'generate_series()'也适用于时间戳。 'generate_series('2014-09-27','2014-10-04','1 hour':: interval)'可能会做你想要的。 – wildplasser 2014-10-05 10:40:57

+0

纠错:那应该是'generate_series('2014-09-27 00:00:00','2014-10-04 00:00:00','1小时':: interval)' – wildplasser 2014-10-05 11:29:38

+0

@wildplasser啊,是的,你是对的 - 这是一个很好的重构建议,我会解决这个问题!^_ ^ – Slania 2014-10-05 14:25:30

回答

0

幼龙 - 这里是我已经和似乎工作:

现已开始接受批评的,性能优化的结构和我的查询冗余!^_ ^(减去直接生成时间序列,而不是所有最终修复的扭曲间隔数学)

with intervals as (
    select 
     (select '09/29/2014'::date) + (n  || ' minutes')::interval start_time, 
     (select '09/29/2014'::date) + ((n+60) || ' minutes')::interval end_time 
     from generate_series(0, (24*60*7), 60 * 4) n 
), intervaled_messages as (
    select 
     extract(epoch from i.start_time)::numeric * 1000 as ts, 
     extract(epoch from i.end_time)::numeric * 1000 as end_ts, 
     abs(score - avg(score) over (partition by i.start_time)) as deviation 
    from messages 
    right join intervals i 
     on messages.timestamp >= i.start_time and messages.timestamp < i.end_time 
    where messages.timestamp between '09/29/2014' and '10/06/2014' 
), ranked_messages as (
    select ts, end_ts, deviation, 
    rank() over (partition by ts order by deviation) as rank, 
    row_number() over (partition by ts order by deviation) as row_number 
    from intervaled_messages 
) 
select ts, end_ts, deviation, rank 
from ranked_messages 
where rank between 1 and 5 
    and row_number between 1 and 5 
order by ts; 
0

你应该标题(这只是我的建议)方向:

  1. 获得的平均分(所有记录)
  2. 操作MINUS(row score, avg(score))

-- This will leave you with values also positive and negative

  1. 对来自步骤2的每个操作使用abs(),在相同的计算
  2. 使用rank()和他们为了approprietly
  3. WHERE rank BETWEEN 1 AND 5