2011-03-31 91 views
5

我有两个数据库表中的数据提供建议匹配的要求。基本要求是; - 应针对所讨论的两列之间的最高匹配字数(不考虑顺序)建议“匹配”。在SQL Server中匹配的词

例如,给定数据;

Table A       Table B 
1,'What other text in here'  5,'Other text in here' 
2,'What am I doing here'   6,'I am doing what here' 
3,'I need to find another job' 7,'Purple unicorns' 
4,'Other text in here'   8,'What are you doing in here' 

Ideally, my desired matches would look as follows; 
1 -> 8 (3 words matched) 
2 -> 6 (5 words matched) 
3 -> Nothing 
4 -> 5 (4 words matched) 

我发现word count functions,看起来很有希望,但我想不出如何在SQL语句中使用它,会给我我想要的比赛。另外,链接函数并不是我所需要的,因为它使用charindex,我认为它在一个单词内搜索单词(即'in'将匹配'bin')。

谁能帮我这个?

谢谢。

+0

3场比赛6.双方都有单词 “I”。并且1匹配比8更好。它们共享4个字。 – 2011-03-31 15:08:44

+0

你是对的,但我忘了提及在比赛中不应该有重复。一旦给定的行匹配,它就不能再次匹配。你也对5匹配8,但正如我刚才评论你的答案,理想情况下应该考虑到匹配的整体词的百分比。 – 2011-03-31 15:17:48

回答

5

我用下面sys.dm_fts_parser的句子分成词。有plenty of TSQL split functions around如果你不在SQL Server 2008上或发现这不适合某种原因。

要求每个A.id只能与之前没有使用的B.id配对,反之亦然,我不能想到一个基于高效集的解决方案。

;WITH A(Id, sentence) As 
(
    SELECT 1,'What other text in here' UNION ALL 
    SELECT 2,'What am I doing here'  UNION ALL 
    SELECT 3,'I need to find another job' UNION ALL 
    SELECT 4,'Other text in here'   
), 
B(Id, sentence) As 
(
SELECT 5,'Other text in here'   UNION ALL 
SELECT 6,'I am doing what here'  UNION ALL 
SELECT 7,'Purple unicorns'    UNION ALL 
SELECT 8,'What are you doing in here' 
), A_Split 
    AS (SELECT Id AS A_Id, 
       display_term, 
       COUNT(*) OVER (PARTITION BY Id) AS A_Cnt 
     FROM A 
       CROSS APPLY 
        sys.dm_fts_parser('"' + REPLACE(sentence, '"', '""')+'"',1033, 0,0)), 

    B_Split 
    AS (SELECT Id AS B_Id, 
       display_term, 
       COUNT(*) OVER (PARTITION BY Id) AS B_Cnt 
     FROM B 
       CROSS APPLY 
        sys.dm_fts_parser('"' + REPLACE(sentence, '"', '""')+'"',1033, 0,0)), 
    Joined 
    As (SELECT A_Id, 
       B_Id, 
       B_Cnt, 
       Cnt = COUNT(*), 
       CAST(COUNT(*) as FLOAT)/B_Cnt AS PctMatchBToA, 
       CAST(COUNT(*) as FLOAT)/A_Cnt AS PctMatchAToB 
     from A_Split A 
       JOIN B_Split B 
        ON A.display_term = B.display_term 
     GROUP BY A_Id, 
        B_Id, 
        B_Cnt, 
        A_Cnt) 
SELECT IDENTITY(int, 1, 1) as id, * 
INTO #IntermediateResults 
FROM Joined 
ORDER BY PctMatchBToA DESC, 
      PctMatchAToB DESC 

DECLARE @A_Id INT, 
     @B_Id INT, 
     @Cnt INT 

DECLARE @Results TABLE (
    A_Id INT, 
    B_Id INT, 
    Cnt INT) 

SELECT TOP(1) @A_Id = A_Id, 
       @B_Id = B_Id, 
       @Cnt = Cnt 
FROM #IntermediateResults 
ORDER BY id 

WHILE (@@ROWCOUNT > 0) 
    BEGIN 

     INSERT INTO @Results 
     SELECT @A_Id, 
      @B_Id, 
      @Cnt 

     DELETE FROM #IntermediateResults 
     WHERE A_Id = @A_Id 
       OR B_Id = @B_Id 

     SELECT TOP(1) @A_Id = A_Id, 
        @B_Id = B_Id, 
        @Cnt = Cnt 
     FROM #IntermediateResults 
     ORDER BY id 
    END 

DROP TABLE #IntermediateResults 

SELECT * 
FROM @Results 
ORDER BY A_Id 

返回

A_Id  B_Id  Cnt 
----------- ----------- ----------- 
1   8   3 
2   6   5 
4   5   4 
+0

哇!我以为知道了一些关于SQL的东西,但是你刚刚指出有很多东西我不知道:)这当然非常有用。我忘了提到的一件事是,比赛中不应该有重复。真的,匹配词的比例最高的匹配应该优先。这就是为什么在我的例子中,我有4个匹配到5,因为文本是相等的(100%匹配),因此留下1与8匹配,因为它是下一个最佳匹配。虽然我真的很喜欢你的回答。这是伟大的思想食物。为你+1,如果我有任何声望。 – 2011-03-31 15:13:33

+0

@穆尔:你不是吗? :) @Martin:虽然它已经表明一行只能匹配一次,但我仍然认为你的解决方案是有用的。即使你不打算重做它,这是一个很好的开始。 – 2011-03-31 16:05:29

+0

@安德瑞 - 谢谢。我可能会重做它。它很容易做到ROW_NUMBER()OVER(PARTITION BY A_Id ...)以获得每A的TOP 1和ROW_NUMBER()OVER(PARTITION BY B_Id ...)以获得每B的TOP 1,但它的转义我在结合这两者的一个好方法的时刻 – 2011-03-31 16:23:27