优化分组最大查询

select * 
from records 
where id in (select max(id) from records group by option_id)

即使在数百万行上，该查询也能正常工作。但是你可以从解释语句的结果可以看出：优化分组最大查询

           QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------- 
Nested Loop (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1) 
-> HashAggregate (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1) 
    -> HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1) 
      -> Seq Scan on records records_1 (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1) 
-> Index Scan using records_pkey on records (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057) 
    Index Cond: (id = (max(records_1.id))) 
Total runtime: 1443.752 ms

(cost=0.00..23995.15 rows=1240315 width=8) < - 这说，这是扫描所有行，这是明显的低效。

我也试过重新排序查询：

select r.* from records r 
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id; 

               QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------- 

Nested Loop (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1) 
-> HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1) 
    -> Seq Scan on records (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1) 
-> Index Scan using records_pkey on records r (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057) 
    Index Cond: (id = (max(records.id))) 
Total runtime: 840.809 ms

(cost=0.00..23995.15 rows=1240315 width=8) < - 仍在扫描所有行。

我试过并没有索引(option_id),(option_id, id)(option_id, id desc)，他们没有任何影响查询计划。

有没有在Postgres中执行群组最大查询而不扫描所有行的方法？

我正在寻找，以编程方式，它是一个索引，它存储每个option_id插入到记录表时的最大ID。这样，当我查询option_ids的最大值时，我应该只需要扫描索引记录多次，因为有不同的option_id。

我见过select distinct on来自高层次的用户回答所有问题（感谢@Clodoaldo Neto给我关键字搜索）。这是为什么它不起作用：

create index index_name on records(option_id, id desc) 

select distinct on (option_id) * 
from records 
order by option_id, id desc 
               QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------------------------ 
Unique (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1) 
    -> Index Scan using records_option_id_id_idx on records (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1) 
Total runtime: 1668.817 ms

这很好，它使用索引。然而，使用索引来扫描所有ID并没有什么意义。根据我的处决，它实际上比简单的顺序扫描慢。

有趣的是，MySQL的5.5能够优化上records(option_id, id)

mysql> select count(1) from records; 

+----------+ 
| count(1) | 
+----------+ 
| 1086342 | 
+----------+ 

1 row in set (0.00 sec) 

mysql> explain extended select * from records 
     inner join (select max(id) max_id from records group by option_id) mr 
                 on mr.max_id= records.id; 

+------+----------+--------------------------+ 
| rows | filtered | Extra     | 
+------+----------+--------------------------+ 
| 1056 | 100.00 |       | 
| 1 | 100.00 |       | 
| 201 | 100.00 | Using index for group-by | 
+------+----------+--------------------------+ 

3 rows in set, 1 warning (0.02 sec)

来源

2014-06-16 nurettin

“不过使用索引扫描所有的行并没有真正多大感觉“---它的确如此。索引比整个数据集小，它们在缓存中的机会更大。它不会扫描实际的行，但索引。 – zerkms

创建索引的* original *查询的计划是什么？ – zerkms

@zerkms索引option_id没有区别（正如我在问题中所述）索引option_id_id_desc或option_id_id在查询计划中也没有区别。 – nurettin

假设相对几排options for 多行records。

通常情况下，你将有一个查找表options是从records.option_id参考，最好是使用foreign key constraint。如果你不这样做，我建议建立一个强制参照完整性：

CREATE TABLE options (
    option_id int PRIMARY KEY 
, option text UNIQUE NOT NULL 
); 

INSERT INTO options 
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names 
FROM records;

那么我们有没有必要仿效loose index scan任何更多，这成为非常简单和快速。相关的子查询可以使用(option_id, id)上的普通索引。

SELECT option_id 
     ,(SELECT max(id) 
     FROM records 
     WHERE option_id = o.option_id 
     ) AS max_id 
FROM options o 
ORDER BY 1;

这包括表records中没有匹配的选项。如果需要，您可以在max_id处获得NULL，并且可以在外部SELECT中轻松删除这些行。

或（相同的结果）：

SELECT option_id 
    , (SELECT id 
     FROM records 
     WHERE option_id = o.option_id 
     ORDER BY id DESC NULLS LAST 
     ) AS max_id 
FROM options o 
ORDER BY 1;

可能有点快。子查询使用排序顺序DESC NULLS LAST - 与忽略NULL值的集合函数max()相同。排序只是DESC将有NULL第一：

Why do NULL values come first when ordering DESC in a PostgreSQL query?

所以，对于这个完美的指数：

CREATE INDEX on records (option_id, id DESC NULLS LAST);

没有多大关系，而列被定义NOT NULL。

仍然可以在小表options上进行顺序扫描，这只是获取所有行的最快方式。 ORDER BY可能会引入索引（唯一）扫描以获取预先排序的行。
大表records只能通过（位图）索引扫描来访问 - 或者，如果可能的话，index-only scan。

SQL Fiddle显示简单情况下的两个仅索引扫描。

或者使用LATERAL加入了Postgres里9.3+类似的效果：

Optimize GROUP BY query to retrieve latest record per user

来源

2014-06-24 02:16:35

select distinct on (option_id) * 
from records 
order by option_id, id desc

指标简单地使用索引的查询，如果cardinality是有利的才会被使用。也就是说你可以尝试一个复合索引

create index index_name on records(option_id, id desc)

来源

2014-06-16 12:57:15

你提到想要一个只索引每个option_id的max（id）的索引。 PostgreSQL目前不支持这个功能。如果将来添加这样的功能，可能会通过对聚合查询进行物化视图的机制，然后索引物化视图。不过，我不会期望至少有几年。

但是，现在您可以做的是使用递归查询，使其跳过索引以查找每个唯一值option_id。有关技术的一般描述，请参阅the PostgreSQL wiki page。

您可以使用此为您的情况下，它写的递归查询返回option_id的不同的值，然后对于那些每一个子查询的MAX（ID）方式：

with recursive dist as (
    select min(option_id) as option_id from records 
union all 
    select (select min(option_id) from records where option_id > dist.option_id) 
    from dist where dist.option_id is not null 
) 

select option_id, 
    (select max(id) from records where records.option_id=dist.option_id) 
from dist where option_id is not null;

这是丑陋的，但是你可以将它隐藏在视图之后。

在我的手中，它运行在43ms，而不是513ms的on distinct品种。

如果您可以找到一种方法将max（id）合并到递归查询中，那么它可能会快两倍，但我找不到这样做的方法。问题是这些查询的语法相当严格，不能使用“limit”或“order by”与UNION ALL结合使用。

该查询触及广泛散布在索引中的页面，如果这些页面不适合缓存，那么您将会执行大量低效的IO。但是，如果这种类型的查询很流行，那么1057叶索引页在缓存中存在的问题很少。

这是怎么设置我的测试案例：

create table records as select floor(random()*1057)::integer as option_id, floor(random()*50000000)::integer as id from generate_series(1,1240315); 
create index on records (option_id ,id); 
explain analyze;

来源

2014-06-23 19:33:30 jjanes

PostgreSQL不支持宽松扫描其MySQL是能够使用像这样的查询。这是你在MySQL计划中看到的Using index for group-by。

基本上，它返回匹配组合键子集的范围中的第一个或最后一个条目，然后搜索该子集的下一个或前一个值。

在你的情况首先返回整个索引对(option_id, id)的最后一个值（其定义恰好持有MAX(id)为最大option_id），然后搜索与旁边最大option_id等最后一个值。

PostgreSQL的优化器不能构建这样的计划，但是，PostgreSQL可以让你在SQL中模拟它。如果你有很多记录，但很少有清晰的option_id，这是值得的。

要做到这一点，首先要创建索引：

CREATE INDEX ix_records_option_id ON records (option_id, id);

然后运行此查询：

WITH RECURSIVE q (option_id) AS 
     (
     SELECT MIN(option_id) 
     FROM records 
     UNION ALL 
     SELECT (
       SELECT MIN(option_id) 
       FROM records 
       WHERE option_id > q.option_id 
       ) 
     FROM q 
     WHERE option_id IS NOT NULL 
     ) 
SELECT option_id, 
     (
     SELECT MAX(id) 
     FROM records r 
     WHERE r.option_id = q.option_id 
     ) 
FROM q 
WHERE option_id IS NOT NULL

看到它在sqlfiddle.com：http://sqlfiddle.com/#!15/4d77d/4

来源

2014-06-23 20:17:51 Quassnoi

优化分组最大查询

回答

相关问题