我正在运行PostgresSQL 9.6.2并且有一个包含7列大约2,900,000行的表。该表是临时的,它是主题重复数据删除过程的一部分,它旨在根据不同的规则集将新的id(s_id_new)分配给相同的主题。总的来说,我执行的内部连接大约10-12次,每次都是类似的,但稍有不同的数据子集/不同的WHERE条件/不同的连接列。多列优化Postgresql内部联接(特别是自联接)
现在,查询效率很低,没有完成(必须在2小时后取消)。
为了优化的目的,我创建了一个数据子集(50000行)。
\d subject_subset;
Column | Type | Modifiers
----------------+------------------------+-----------
s_id | text |
surname_clean | character varying(20) |
name_clean | character varying(20) |
fullname_clean | character varying(100) |
id1 | character varying(20) |
id2 | character varying(20) |
id3 | character varying(20) |
s_id_new | character varying(20) |
Indexes:
"subject_subset_s_id_new_idx" btree (s_id_new)
我想查询优化
select s_id_new, max(I_s_id) as s_id_deduplicated
from (select a.*, b.s_id_new as I_s_id
from public.subject_subset a
inner join public.subject_subset b on a.surname_clean=b.surname_clean
and a.id2=b.id2
where
a.id1 is null
and a.id2 is not null
and a.surname_clean is not null) h
group by s_id_new;
The result of the EXPLAIN ANALYZE:
https://explain.depesz.com/s/7knH
"GroupAggregate (cost=5616.65..5620.39 rows=142 width=90) (actual time=32542.127..46938.858 rows=2889 loops=1)"
" Group Key: a.s_id_new"
" -> Sort (cost=5616.65..5617.42 rows=310 width=116) (actual time=32542.116..43194.626 rows=18356220 loops=1)"
" Sort Key: a.s_id_new"
" Sort Method: external merge Disk: 531760kB"
" -> Hash Join (cost=1114.72..5603.82 rows=310 width=116) (actual time=13.159..4892.011 rows=18356220 loops=1)"
" Hash Cond: (((b.surname_clean)::text = (a.surname_clean)::text) AND ((b.id2)::text = (a.id2)::text))"
" -> Seq Scan on subject_subset b (cost=0.00..1111.00 rows=50000 width=174) (actual time=0.011..10.775 rows=50000 loops=1)"
" -> Hash (cost=1111.00..1111.00 rows=248 width=174) (actual time=13.137..13.137 rows=15044 loops=1)"
" Buckets: 16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1151kB"
" -> Seq Scan on subject_subset a (cost=0.00..1111.00 rows=248 width=174) (actual time=0.005..9.330 rows=15044 loops=1)"
" Filter: ((id1 IS NULL) AND (id2 IS NOT NULL) AND (surname_clean IS NOT NULL))"
" Rows Removed by Filter: 34956"
"Planning time: 0.236 ms"
"Execution time: 47013.839 ms"
至于我可以看到它的子查询的是造成的问题,当全表进行排序消耗的超大空间排序,但我无法弄清楚如何优化它。
性能略有提高的唯一原因是分配新的整数ID与dense_rank,但它是不够的。
如果你用文字解释这个特定查询试图完成的目标,这将有所帮助。否则,我们必须尝试根据查询来猜测任务。 – 2017-06-02 12:49:20
该查询旨在重复删除主体 - 公司和自然人 - 为其分配相同的ID。两个具有相同文档ID的Jonh Smiths在数据库中具有不同的ID(s_id) - > Code为他们分配一个新的ID =他现在拥有的s_id的最大值。有时辅助数据用于重复数据删除(地址,电话等),但想法保持不变。 – Dominix