2012-07-11 60 views
2

我一直在对大小为56GB的表(789700760行)运行以下查询,并且在执行时间内遇到瓶颈。从我之前的一些例子中我可以看出,可能有一种方法可以'嵌套'INNER JOIN,以便查询对大型数据集执行更好。特别是下面的查询花了7.651小时完成MPP PostgreSQL部署的执行。为大型Postgresql表优化嵌套连接窗口函数

create table large_table as 
select column1, column2, column3, column4, column5, column6 
from 
(
    select 
    a.column1, a.column2, a.start_time, 
    rank() OVER( 
     PARTITION BY a.column2, a.column1 order by a.start_time DESC 
    ) as rank, 
    last_value(a.column3) OVER (
     PARTITION BY a.column2, a.column1 order by a.start_time ASC 
     RANGE BETWEEN unbounded preceding and unbounded following 
    ) as column3, 
    a.column4, a.column5, a.column6 
    from 
    (table2 s 
     INNER JOIN table3 t 
     ON s.column2=t.column2 and s.event_time > t.start_time 
    ) a 
) b 
where rank =1; 

Question 1: Is there a way to modify the above sql code to speed up the overall execution time of the query?

+0

如果rank为每个column2,column1组合仅返回一行,则last_value()似乎是多余的。你期待多行吗?否则,rank = 1的column3中的值应与计算值相同。 – 2012-07-11 18:45:30

回答

1

您可以将LAST_VALUE移动到外的子查询,这可能会买你的表现有所改善。该LAST_VALUE是越来越值栏3的每个地方,开始时间为最小的分区 - 这正是在秩= 1:

select column1, column2, 
     ast_value(a.column3) OVER (PARTITION BY column2, column1 order by start_time ASC 
            RANGE BETWEEN unbounded preceding and unbounded following 
           ) as column3, 
     column4, column5, column6 
from (select a.column1, a.column2, a.start_time, 
      rank() OVER (PARTITION BY a.column2, a.column1 order by a.start_time DESC 
         ) as rank, 
      a.column3, a.column4, a.column5, a.column6 
     from (table2 s INNER JOIN 
      table3 t 
      ON s.column2 = t.column2 and s.event_time > t.start_time 
      ) a 
    ) b 
where rank = 1 

否则,你需要给在执行计划和表2和表3,以了解更多信息获得更多帮助。

+0

感谢您的帮助我正在测试更新查询的时间,但是当我使用last_value(a.column3)时,遇到了一个小问题,给出的错误是ERROR:缺少表“a”的FROM-clause条目。我用last_value(column3)取代了这个命令,这是否仍然有效? – user7980 2012-07-11 23:30:05