2014-10-29 29 views
3

我有一张用户表,以及一张“Facebook朋友”关系表。考虑到(已知)用户列表,我想快速找到所有在该组中具有2个或更多用户的Facebook朋友的用户。使用JOIN而不是HAVING(COUNT> n)来提高性能

(这基本上可以归结为一个问题:我能否重写GROUP BY/HAVING使用的JOIN?)

这里是我的工作架构的简化版本。我在这里使用VARCHAR使我的示例数据(下面)中的用户名更易于理解; IRL的相关列为INT:

-- Simplified Schema 
CREATE TABLE _users (
    user_name VARCHAR NOT NULL PRIMARY KEY, 
    fb_id  VARCHAR NULL UNIQUE 
); 
CREATE TABLE _fb_friends (
    id   SERIAL PRIMARY KEY, 
    user_name VARCHAR NULL REFERENCES _users(user_name), 
    friend_fb_id VARCHAR NULL REFERENCES _users(fb_id), 
    UNIQUE (user_name, friend_fb_id) 
); 

请注意,friend_fb_id上没有(可访问的)索引。

还要注意_fb_friends表是巨大的 - 比_users表大几个数量级 - 使得明显的GROUP BY/HAVING解决方案不可能很慢。 I.E.这是不可行的:

-- Using GROUP BY/HAVING: Obvious solution, but way too slow. 
-- Does a SEQ SCAN on the gigantic table 
SELECT me.* 
FROM 
    _users me 
    LEFT OUTER JOIN _fb_friends ff ON (
     ff.user_name = me.user_name 
    ) 
    LEFT OUTER JOIN _users friend ON (
     friend.fb_id = ff.friend_fb_id 
    ) 
GROUP BY me.user_name 
HAVING COUNT(friend.user_name) >= 2; 

我改写了这用连接,但我不知道我想出了一个解决方案是有效的或最佳:

-- Using JOINs: Way faster, but is it correct? Better way? 
SELECT DISTINCT me.* 
FROM (
    _users me 
    LEFT OUTER JOIN _fb_friends ff1 ON (
     ff1.user_name = me.user_name 
    ) 
    LEFT OUTER JOIN _fb_friends ff2 ON (
     ff2.user_name = me.user_name 
     AND ff2.friend_fb_id <> ff1.friend_fb_id 
    ) 
    LEFT OUTER JOIN _users friend ON (
     friend.fb_id = ff1.friend_fb_id 
    ) 
    LEFT OUTER JOIN _users friend_2 ON (
     friend_2.fb_id = ff2.friend_fb_id 
    ) 
) 
WHERE (
    friend.user_name IS NOT NULL 
    AND friend_2.user_name IS NOT NULL 
); 

对于它的价值,我写的一个简单的测试例子,似乎正常工作。但我真的不确定这是否正确,或者我正在以这种最好的方式进行讨论。这两种策略返回相同的用户:

BEGIN; 

CREATE TABLE _users (
    user_name VARCHAR NOT NULL PRIMARY KEY, 
    fb_id  VARCHAR NULL UNIQUE 
); 
CREATE TABLE _fb_friends (
    id   SERIAL PRIMARY KEY, 
    user_name VARCHAR NULL REFERENCES _users(user_name), 
    friend_fb_id VARCHAR NULL REFERENCES _users(fb_id) 
); 
INSERT INTO _users (user_name, fb_id) VALUES 
    ('Bob', 'bob'), 
    ('Joe', 'joe'), 
    ('Will', 'will'), 
    ('Marcus', 'marcus'), 
    ('Mitch', 'mitch'), 
    ('Rick', 'rick'); 
INSERT INTO _fb_friends (user_name, friend_fb_id) VALUES 
    ('Bob', 'joe'), 
    ('Will', 'marcus'), 
    ('Joe', 'bob'), 
    ('Joe', 'marcus'), 
    ('Joe', 'mitch'), 
    ('Marcus', 'will'), 
    ('Marcus', 'joe'), 
    ('Mitch', 'joe'); 

SELECT 'GROUP BY/HAVING' AS Strategy, me.* 
FROM 
    _users me 
    LEFT OUTER JOIN _fb_friends ff ON (
     ff.user_name = me.user_name 
    ) 
    LEFT OUTER JOIN _users friend ON (
     friend.fb_id = ff.friend_fb_id 
    ) 
GROUP BY me.user_name 
HAVING COUNT(friend.user_name) >= 2; 

SELECT DISTINCT 'JOIN' AS Strategy, me.* 
FROM (
    _users me 
    LEFT OUTER JOIN _fb_friends ff1 ON (
     ff1.user_name = me.user_name 
    ) 
    LEFT OUTER JOIN _fb_friends ff2 ON (
     ff2.user_name = me.user_name 
     AND ff2.friend_fb_id <> ff1.friend_fb_id 
    ) 
    LEFT OUTER JOIN _users friend ON (
     friend.fb_id = ff1.friend_fb_id 
    ) 
    LEFT OUTER JOIN _users friend_2 ON (
     friend_2.fb_id = ff2.friend_fb_id 
    ) 
) 
WHERE (
    friend.user_name IS NOT NULL 
    AND friend_2.user_name IS NOT NULL 
); 

DROP TABLE _fb_friends; 
DROP TABLE _users; 

COMMIT; 

所以基本上,我的问题是:

  1. 是我加盟的解决方案是否正确?
  2. 有没有比这更好的/规范的方法?

索引friend_fb_id以及更改模式被视为禁止访问。我需要用我目前拥有的最好的东西做到最好。

+0

我并没有强加这些限制,这只是我必须处理的情况。所以这里没有什么“魔力”,问题是查询是否可以以更有效的方式进行修改。我一直无法找到此JOIN策略的任何示例,并希望从其他开发者处获得反馈。 – danonanimal 2014-10-29 21:16:30

+0

如果没有索引 - 它将是一个完整的扫描。 Fullscans慢。如果你想提高你的表现 - 第一步是正确的索引。你不能改变模式?直到第一步完成才有第二步。我坚持:你想要的是“魔术”。你无法从魔术般的地方获得性能(除非你购买更昂贵的硬件) – zerkms 2014-10-29 22:13:06

+0

为了记录,JOIN解决方案不执行SEQ扫描;如果确实如此,那么它与GROUP BY一样具有性能,我不会问这个问题。在带有1亿行的产品数据库中,GROUP BY策略需要1-30分钟,而JOIN需要约3秒。 – danonanimal 2014-10-29 23:15:59

回答

0

你可以使用临时表吗?如果是的话,试试这个...

drop table if exists friend_count; 

create temporary table friend_count ( 
    user_name varchar not null primary key, 
    friend_count int not null 
); 

create index on friend_count (friend_count); 

insert into friend_count select 
    user_name, 
    count(*) 
from _fb_friends 
/* place more code here necessary to count only the firends within a smaller 
    group of users */ 
group by user_name; 

select 
    me.user_name, 
    me.fb_id 
from _users me 
join friend_count fc on fc.user_name = me.user_name 
where fc.friend_count >= 2; 
0

我没有足够大的数据集来检查,但看看这个执行得更快。

select me.* 
from _users me 
where 2=(select count(1) from 
      (select 1 from _fb_friends ff 
      join _users friend on friend.fb_id=ff.friend_fb_id 
      where ff.user_name=me.user_name 
      limit 2) x 
     )