Postgres为什么不在简单的GROUP BY上使用索引？

我已经创建了索引的36M行表上type列：Postgres为什么不在简单的GROUP BY上使用索引？

CREATE TABLE items AS 
    SELECT 
    (random()*36000000)::integer AS id, 
    (random()*10000)::integer AS type, 
    md5(random()::text) AS s 
    FROM 
    generate_series(1,36000000); 
CREATE INDEX items_type_idx ON items USING btree ("type");

我运行这个简单的查询，并期望PostgreSQL一起使用我的索引：

explain select count(*) from "items" group by "type";

但查询规划决定使用序列扫描来代替：

HashAggregate (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1) 
    Group Key: type 
    -> Seq Scan on items (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1) 
Planning time: 0.107 ms 
Execution time: 6478.525 ms

时间不解释道： 5S 979ms

我从here和here尝试了几种解决方案：

运行VACUUM ANALYZE或VACUUM ANALYZE
配置default_statistics_target，random_page_cost，work_mem

，但没有从设定enable_seqscan = OFF有助于分开：

SET enable_seqscan = OFF; 
explain select count(*) from "items" group by "type"; 

GroupAggregate (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1) 
    Group Key: type 
    -> Index Only Scan using items_type_idx on items (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1) 
     Heap Fetches: 0 
Planning time: 0.103 ms 
Execution time: 5256.667 ms

时间不解释道： 659ms

查询索引扫描是10倍左右我的机器上更快。

有没有比设置enable_seqscan更好的解决方案？

UPD1

我的PostgreSQL版本是9.6.3，work_mem = 4MB（试过64MB），random_page_cost = 4（试过1.1），max_parallel_workers_per_gather = 0（试过4）。

UPD2

我试图填补型列不是随机数，但i/10000使pg_stats.correlation = 1 - 仍然seqscan。

UPD3

@jgh是100％正确的：

当表的行宽比一些指标

我做了大更广这通常只发生列data，现在postgres使用索引。感谢大家！

来源

2017-07-06 Denis Drozdov

什么是你的PostgreSQL的版本？另外，请提供'EXPLAIN ANALYZE'的输出。 –

http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=0c5c410657513d1bda7f2e21a4d36eb9 - 只比较'enable_seqscan = ON'和'enable_seqscan = OFF'的实际时间安排 – Abelisto

您对work_mem和random_page_cost的设置是什么？ [和：为什么表没有主键？] – wildplasser

的Index-only scans维基说

重要的是要认识到，计划员关注的是最小化查询的总成本是非常重要的。使用数据库， I/O的成本通常占主导地位。因此，如果索引为的查询的计数（*）没有任何谓词“查询将仅使用仅索引扫描”。这通常只发生在表的行宽比一些索引的宽得多。当规划器推测该该 将减少I/O的所需的总量，根据其基于成本不完善建模

和

仅索引扫描仅使用。这一切都严重依赖于元组的可视性，如果无论如何都会使用索引（即谓词的选择性等），并且实际上有一个索引可用，原则上只能用于索引扫描

因此，你的指数是不是认为“显著小”和整个数据集被读取，从而导致在使用NGF的规划者扫描

来源

2017-07-06 18:52:19 JGH

Postgres为什么不在简单的GROUP BY上使用索引？

回答

相关问题