2015-08-27 34 views
6

此问题与this之一有关。我有包含设备功率值的表,我需要计算给定时间范围内的功耗并返回10个最耗电设备。我已经生成了192个设备和7742208个测量记录(每个40324)。这大概是设备在一个月内将产生多少记录。如何使用窗口函数优化SQL查询

对于这个数据量,我当前的查询需要40多秒来执行,这是因为时间跨度和设备和测量数量可能会高得多。我是否应该尝试用不同的方法解决这个问题,而不是使用滞后()OVER PARTITION以及可以进行哪些其他优化?我真的很感激代码示例的建议。

的PostgreSQL版本9.4

查询与示例值:

SELECT 
    t.device_id, 
    sum(len_y*(extract(epoch from len_x))) AS total_consumption 
FROM (
    SELECT 
     m.id, 
     m.device_id, 
     m.power_total, 
     m.created_at, 
     m.power_total+lag(m.power_total) OVER (
     PARTITION BY device_id 
     ORDER BY m.created_at 
    ) AS len_y, 
     m.created_at-lag(m.created_at) OVER (
     PARTITION BY device_id 
     ORDER BY m.created_at 
    ) AS len_x 
    FROM 
     measurements AS m 
    WHERE m.created_at BETWEEN '2015-07-30 13:05:24.403552+00'::timestamp 
    AND '2015-08-27 12:34:59.826837+00'::timestamp 
) AS t 
GROUP BY t.device_id 
ORDER BY total_consumption 
DESC LIMIT 10; 

表信息:

Column |   Type   |       Modifiers 
--------------+--------------------------+---------------------------------------------------------- 
id   | integer     | not null default nextval('measurements_id_seq'::regclass) 
created_at | timestamp with time zone | default timezone('utc'::text, now()) 
power_total | real      | 
device_id | integer     | not null 
Indexes: 
    "measurements_pkey" PRIMARY KEY, btree (id) 
    "measurements_device_id_idx" btree (device_id) 
    "measurements_created_at_idx" btree (created_at) 
Foreign-key constraints: 
    "measurements_device_id_fkey" FOREIGN KEY (device_id) REFERENCES devices(id) 

查询计划:

Limit (cost=1317403.25..1317403.27 rows=10 width=24) (actual time=41077.091..41077.094 rows=10 loops=1) 
-> Sort (cost=1317403.25..1317403.73 rows=192 width=24) (actual time=41077.089..41077.092 rows=10 loops=1) 
Sort Key: (sum((((m.power_total + lag(m.power_total) OVER (?))) * date_part('epoch'::text, ((m.created_at - lag(m.created_at) OVER (?))))))) 
Sort Method: top-N heapsort Memory: 25kB 
-> GroupAggregate (cost=1041700.67..1317399.10 rows=192 width=24) (actual time=25361.013..41076.562 rows=192 loops=1) 
Group Key: m.device_id 
-> WindowAgg (cost=1041700.67..1201314.44 rows=5804137 width=20) (actual time=25291.797..37839.727 rows=7742208 loops=1) 
-> Sort (cost=1041700.67..1056211.02 rows=5804137 width=20) (actual time=25291.746..30699.993 rows=7742208 loops=1) 
Sort Key: m.device_id, m.created_at 
Sort Method: external merge Disk: 257344kB 
-> Seq Scan on measurements m (cost=0.00..151582.05 rows=5804137 width=20) (actual time=0.333..5112.851 rows=7742208 loops=1) 
Filter: ((created_at >= '2015-07-30 13:05:24.403552'::timestamp without time zone) AND (created_at <= '2015-08-27 12:34:59.826837'::timestamp without time zone)) 

Planning time: 0.351 ms 
Execution time: 41114.883 ms 

查询以生成测试表和数据:

CREATE TABLE measurements (
    id   serial primary key, 
    device_id integer, 
    power_total real, 
    created_at timestamp 
); 

INSERT INTO measurements(
    device_id, 
    created_at, 
    power_total 
) 
SELECT 
    device_id, 
    now() + (i * interval '1 minute'), 
    random()*(50-1)+1 
FROM (
    SELECT 
    DISTINCT(device_id), 
    generate_series(0,10) AS i 
FROM (
    SELECT 
    generate_series(1,5) AS device_id 
) AS dev_ids 
) AS gen_table; 
+0

(device_id,created_at)上的组合索引如何?顺便说一句,恕我直言,你应该在使用前将'm.power_total + lag(m.power_total)'除以2。 (或者只取平均值) – joop

+1

+1我长期见过的最佳问题。写得很好,适当的样本。我在一秒内创建了示例数据库。现在我应该在'series'中输入什么值来生成一个类似于您当前大小的分贝? –

+2

您的where条件不会删除任何行。这是打算吗?排序也在磁盘上完成:'外部合并磁盘:257344kB',这需要相当长的时间(你的执行计划失去了缩进,所以它有点难以阅读)。如果您增加会话的'work_mem',直到在内存中完成排序,您应该会看到更好的性能。 –

回答

1

我会尝试计算的某些部分移动到PHAS行插入。

添加新栏:

alter table measurements add consumption real; 

更新列:

with m1 as (
    select 
     id, power_total, created_at, 
     lag(power_total) over (partition by device_id order by created_at) prev_power_total, 
     lag(created_at) over (partition by device_id order by created_at) prev_created_at 
    from measurements 
    ) 
update measurements m2 
set consumption = 
    (m1.power_total+ m1.prev_power_total)* 
    extract(epoch from m1.created_at- m1.prev_created_at) 
from m1 
where m2.id = m1.id; 

创建触发器:

create or replace function before_insert_on_measurements() 
returns trigger language plpgsql 
as $$ 
declare 
    rec record; 
begin 
    select power_total, created_at into rec 
    from measurements 
    where device_id = new.device_id 
    order by created_at desc 
    limit 1; 
    new.consumption:= 
     (new.power_total+ rec.power_total)* 
     extract(epoch from new.created_at- rec.created_at); 
    return new; 
end $$; 

create trigger before_insert_on_measurements 
before insert on measurements 
for each row execute procedure before_insert_on_measurements(); 

查询:

select device_id, sum(consumption) total_consumption 
from measurements 
-- where conditions 
group by 1 
order by 1 
+0

谢谢!使用这种方法,我能够实现9秒的执行时间。顺便说一下,它应该按第二栏进行排序。 =) –

0

我认为你的问题是一个不同的问题。

我创建具有8个M行(200装置,40000次测量)

和响应的采样数据是非常快(2秒)

的Postgres 9.3 - ICORE 5/3.2兆赫/ 8GB/SATA HDD/Windows 7的
我没有创建索引,但(错过在你的安装脚本的一部分)

enter image description here

+0

你确定'where'条件没有过滤出800万行吗?因为这就是原始查询中发生的情况。如果我运行800万行的样本,大约需要12秒(仍然比原始时间更快) –

+0

@a_horse_with_no_name我只是从OP问题中复制选择。将再次检查。 –

+0

@a_horse_with_no_name你为什么说不过滤任何行?内部选择'WHERE m.created_at BETWEEN'2015-07-30 13:05:24.403552 + 00':: timestamp AND'2015-08-27 12:34:59.826837 + 00':: timestamp'带来5800行,如果没有'where'选项带来8M记录并需要300秒。 –