2013-10-24 60 views
1

有一个非常大的表(超过200万行)
SID INT,wordID的INT(PK SID,wordID的)找到确切的FK匹配

要查找的SID的具有完全相同的wordID的(没有额外)
对于SID超过100的wordID完全匹配的几率下降,从而愿意将其限制在100
(但想进入1000)

如果这是学校和SID是类和wordID的我们是学生。
然后,我想找到具有完全相同学生的课程。

SID,的wordID
1,1
1,2
1,3
2,2
2,3
3,1
3,4
5,1
5 ,2
6,2
6,3
7,1
7,2
8,1
8,1

SID 6和2具有完全相同的的wordID的
SID 5,图7和8有完全相同的的wordID的

这是我迄今为止
我想,以消除两个删除#temp3_sID1_sID2和照顾,在插入件之上
但我会尝试任何想法
它不喜欢你可以很容易地创建一个200个百万行的表

测试
drop table #temp_sID_wordCount 
    drop table #temp_count_wordID_sID 
    drop table #temp3_wordID_sID_forThatCount 
    drop table #temp3_sID1_sID2 
    drop table #temp3_sID1_sID2_keep 
    create table #temp_sID_wordCount (sID int primary key, ccount int not null) 
    create table #temp_count_wordID_sID (ccount int not null, wordID int not null, sID int not null, primary key (ccount, wordID, sID)) 
    create table #temp3_wordID_sID_forThatCount (wordID int not null, sID int not null, primary key(wordID, sID)) 
    create table #temp3_sID1_sID2_keep (sID1 int not null, sID2 int not null, primary key(sID1, sID2)) 
    create table #temp3_sID1_sID2 (sID1 int not null, sID2 int not null, primary key(sID1, sID2)) 
    insert into #temp_sID_wordCount 
    select sID, count(*) as ccount 
    FROM [FTSindexWordOnce] with (nolock) 
    group by sID 
    order by sID; 
    select count(*) from #temp_sID_wordCount where ccount <= 100; -- 701,966 
    truncate table #temp_count_wordID_sID 
    insert into #temp_count_wordID_sID 
    select #temp_sID_wordCount.ccount, [FTSindexWordOnce].wordID, [FTSindexWordOnce].sID 
    from #temp_sID_wordCount 
    join [FTSindexWordOnce] with (nolock) 
     on [FTSindexWordOnce].sID = #temp_sID_wordCount.sID 
    and ccount >= 1 and ccount <= 10 
    order by #temp_sID_wordCount.ccount, [FTSindexWordOnce].wordID, [FTSindexWordOnce].sID; 
    select count(*) from #temp_sID_wordCount; -- 34,860,090 

    truncate table #temp3_sID1_sID2_keep 
    declare cur cursor for 
    select top 10 ccount from #temp_count_wordID_sID group by ccount order by ccount 

    open cur 
    declare @count int, @sIDcur int 
    fetch next from cur into @count 
    while (@@FETCH_STATUS = 0) 
    begin 
     --print (@count) 
     --select count(*), @count from #temp_sID_wordCount where #temp_sID_wordCount.ccount = @count 
     truncate table #temp3_wordID_sID_forThatCount 
     truncate table #temp3_sID1_sID2 

     -- wordID and sID for that unique word count 
     -- they can only be exact if they have the same word count 
     insert into #temp3_wordID_sID_forThatCount 
     select  #temp_count_wordID_sID.wordID 
       , #temp_count_wordID_sID.sID 
     from #temp_count_wordID_sID 
     where #temp_count_wordID_sID.ccount = @count 
     order by #temp_count_wordID_sID.wordID, #temp_count_wordID_sID.sID 

     -- select count(*) from #temp3_wordID_sID_forThatCount 

     -- this has some duplicates 
     -- sID1 is the group 
     insert into #temp3_sID1_sID2 
     select w1.sID, w2.sID 
     from #temp3_wordID_sID_forThatCount as w1 with (nolock) 
     join #temp3_wordID_sID_forThatCount as w2 with (nolock) 
      on w1.wordID = w2.wordID 
     and w1.sID <= w2.sID   
     group by w1.sID, w2.sID 
     having count(*) = @count 
     order by w1.sID, w2.sID 

     -- get rid of the goups of 1  
     delete #temp3_sID1_sID2 
     where sID1 in (select sID1 from #temp3_sID1_sID2 group by sID1 having count(*) = 1) 

     -- get rid of the double dips   
     delete #temp3_sID1_sID2 
     where #temp3_sID1_sID2.sID1 in 
       (select distinct s1del.sID1 -- these are the double dips 
       from #temp3_sID1_sID2 as s1base with (nolock) 
       join #temp3_sID1_sID2 as s1del with (nolock) 
        on s1del.sID1 > s1base.sID1 
       and s1Del.sID1 = s1base.sID2) 

     insert into #temp3_sID1_sID2_keep  
     select #temp3_sID1_sID2.sID1 
      , #temp3_sID1_sID2.sID2 
     from #temp3_sID1_sID2 with (nolock) 
     order by #temp3_sID1_sID2.sID1, #temp3_sID1_sID2.sID2 

    fetch next from cur into @count 
    end 
    close cur 
    deallocate cur 

select * 
FROM #temp3_sID1_sID2_keep with (nolock) 
order by 1,2 

回答

1

所以,正如我所看到的,任务是找到相等的子集。

首先,我们可以找到对等的子集:

;with tmp1 as (select sID, cnt = count(wordID) from [Table] group by sID) 
select s1.sID, s2.sID 
from tmp1 s1 
    cross join tmp1 s2 
    cross apply (
     select count(1) 
     from [Table] d1 
      join [Table] d2 on d2.wordID = d1.wordID 
     where d1.sID = s1.sID and d2.sID = s2.sID 
    ) c(cnt) 
where s1.cnt = s2.cnt 
    and s1.sID > s2.sID 
    and s1.cnt = c.cnt 

输出是:

sID  sID 
----------- ----------- 
6   2 
7   5 
8   5 
8   7 

然后对可以合并成组,如果必要的话:

sID   gNum 
----------- ----------- 
2   1 
6   1 
5   2 
7   2 
8   2 

见下面的SqlFiddle示例中的详细信息。

SqlFiddle Sample


另一种方法是计算哈希函数为每个子集数据:

;with a as (
    select distinct sID from [Table] 
) 
select sID, 
    hashbytes('sha1', (
     select cast(wordID as varchar(10)) + '|' 
     from [Table] 
     where sID = a.sID 
     order by wordID 
     for xml path(''))) 
from a 

然后子集可以基于哈希值进行分组。

SqlFiddle Sample

最后一个花了不到一分钟,我的机器上的约1000万行(20K SID值多达一千wordID的每个)的测试数据。您也可以通过排除没有任何其他wordID计数匹配的sID来优化它。

+0

它适用于示例数据,但在大型表格上它没有在7小时内完成,所以我无法验证。 – Paparazzi

+0

@Blam,增加了其他的方法,请看到更新的答案。 –

+0

令人敬畏的方法,但我有一个轻微的问题。我需要去最大字数600或我得到字符串或二进制数据将被截断错误。如果我拿出哈希字节,我仍然得到错误,所以它是在XML中。但是在600的极限时间里,它运行时间不到两分钟。 – Paparazzi