我有一个表示有向图一套Django的ORM模型,我试图找回所有的相邻顶点给定的顶点忽略的边缘方向:查询检索邻居太慢
class Vertex(models.Model):
pass
class Edge(models.Model):
orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
# ... other data about this edge ...
的查询Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()
返回正确的结果,但在我的情况下,执行需要很长的时间。
通常,对于我的应用程序,在任何给定时间都会有大约50-100个顶点,并且大约有一百万条边。即使它减少到只有20个顶点和100000层的边缘,该查询需要大约一分半钟的执行:
for i in range(20):
Vertex().save()
vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()
v = vxs[0]
def f1():
return list(Vertex.objects.filter(
Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct())
t1 = timeit.Timer(f1)
print(t1.timeit(number=1)) # 84.21138522100227
在另一方面,如果我起来将查询分为两个部分,我可以得到完全相同的导致只有毫秒屈指可数:
def f2():
q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
return list({x for x in itertools.chain(q1, q2)})
t2 = timeit.Timer(f2)
print(t2.timeit(number=100)/100) # 0.0109818680600074
这第二个版本虽然有问题:
- 这不是原子。边界列表几乎可以保证在我的应用程序中的两个查询之间发生变化,这意味着结果不会代表单个时间点。
- 我无法对结果执行额外的处理和聚合,无需手动循环。 (例如,如果我想要
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field'))
)
为什么第二个版本比第一个版本跑得快得多? 我该如何做到这一点,有没有办法让第一个跑得足够快,以供实际使用?
我目前正在使用Python 3.5.2,PostgreSQL 9.5.6和Django 1.11进行测试,但是如果这是我遇到的问题之一。
这里是由每个查询生成的SQL,以及Postgres的的explan:
第一个:
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061)
HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
第二的:
Vertex.objects.filter(edge_orig__dest=v).distinct()
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061
HashAggregate (cost=1531.42..1531.62 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=848.11..1519.04 rows=4950 width=4)
Hash Cond: (app_edge.orig_id = app_vertex.id)
-> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4)
Recheck Cond: (dest_id = 1061)
-> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0)
Index Cond: (dest_id = 1061)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
@ khampson的版本也需要一分半钟才能运行,所以它也是不可行的。
Vertex.objects.raw(...)
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Join (cost=1.45..2917.45 rows=100000 width=12)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
你能分享两个版本正在生成的SQL吗? – Hamms
@Hamms增加了SQL和Postgres对每个查询的解释。 – AJMansfield
我可能会误解你的模型(实际上,我可能是),但你为什么要过滤'edge_orig__dest = v | edge_orig__dest = v',而不仅仅是'edge_dest = v | edge_origt = v'? – Hamms