2014-06-24 83 views
0

我有一个情况我需要做到以下几点:名称拆分和比较

公司名称:

a. Split text before and after “ - “ 
b. Generate the report where texts before and after “ - “ matches = exact match 
c. Generate the report where texts before and after “ - “ matches = similar matches 

我能到达,直到B点。其中,我能得到具有相同firsthalf和secondhalf结果(如ABC,INC。 - ABC,INC。)名称中使用下列 -

RTRIM(substring(c.companyname,0,charindex('-',c.companyname)))= LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname))) 

但是,我不能做下一个报告(如ABC - abc或abc,inc - abc)

有人可以帮忙吗?

+1

请编辑您的问题,并提供样本数据和期望的结果。 –

+0

名称 迪斯尼 - 迪斯尼 趣香食品 - 趣味食品 趣香食品有限公司 - 趣味食品 迪斯尼 - 迪斯尼公司 我已经有查询允许我拉ROW1和2只RTRIM(子(c.companyname,0,charindex(' - ',c.companyname)))= LTRIM(substring(c.companyname,charindex(' - ',c.companyname,0)+1,len(c.companyname)) )我想要一个查询,我可以得到row3和row4,因为这些也是相同的名字姓氏的情况下,但略有不同(例如存在一个点(。)或附加字(公司) – user3769697

回答

0

试试这个吗?

DECLARE @CompanyNames TABLE (
    CompanyName VARCHAR(512)); 
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney'); 
INSERT INTO @CompanyNames VALUES ('Fun Food - Fun Food'); 
INSERT INTO @CompanyNames VALUES ('Fun Food, Inc. - Fun Food'); 
INSERT INTO @CompanyNames VALUES ('Walt Disney - Walt Disney, Inc.'); 

--Split names 
DECLARE @SplitNames TABLE (
    MatchLeft VARCHAR(128), 
    MatchRight VARCHAR(128)); 
INSERT INTO 
    @SplitNames 
SELECT 
    RTRIM(SUBSTRING(CompanyName, 0, CHARINDEX('-', CompanyName))), 
    LTRIM(SUBSTRING(CompanyName, CHARINDEX('-', CompanyName, 0) + 1, LEN(CompanyName))) 
FROM 
    @CompanyNames; 

--Exact matches 
SELECT 
    MatchLeft, 
    MatchRight, 
    CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END AS Exact 
FROM 
    @SplitNames; 

--Inexact matches 
WITH CleansedCompanyNames AS (
    SELECT 
     MatchLeft AS OriginalMatchLeft, 
     MatchRight AS OriginalMatchRight, 
     REPLACE(REPLACE(REPLACE(MatchLeft, '.', ''), 'Inc', ''), ',', '') AS MatchLeft, 
     REPLACE(REPLACE(REPLACE(MatchRight, '.', ''), 'Inc', ''), ',', '') AS MatchRight 
    FROM 
     @SplitNames) 
SELECT 
    OriginalMatchLeft, 
    OriginalMatchRight, 
    MatchLeft, 
    MatchRight, 
    CASE WHEN MatchLeft = MatchRight THEN 1 ELSE 0 END 
FROM 
    CleansedCompanyNames; 

--Using SOUNDEX 
SELECT 
    MatchLeft, 
    MatchRight, 
    CASE WHEN DIFFERENCE(MatchLeft, MatchRight) >= 3 THEN 1 ELSE 0 END AS Score 
FROM 
    @SplitNames; 

有两种思路有处理不精确匹配:

  • 要么匹配之前删除标点和不受欢迎的词汇(但这将需要建立什么来代替列表);或
  • 使用SOUNDEX来测试字符串相似性。

或者,要使用你原来的例子,你可以用这个SOUNDEX:

SELECT ... 
WHERE 
DIFFERENCE(RTRIM(substring(c.companyname,0,charindex('-',c.companyname))), LTRIM(substring(c.companyname, charindex('-',c.companyname,0)+1, len(c.companyname)))) >= 3 

和你最新的例子:

DECLARE @Company TABLE (
    companyname VARCHAR(500)); 
INSERT INTO @Company VALUES ('Allen Limited - Allen Corporation'); 
INSERT INTO @Company VALUES ('Sweden Corp. - Sweden Corp.'); 
INSERT INTO @Company VALUES ('Alaska Limited - Alaska Limited, Inc.'); 
INSERT INTO @Company VALUES ('New York Inc. - New York Steel Limited'); 
INSERT INTO @Company VALUES ('India Plc - India Plc.'); 
INSERT INTO @Company VALUES ('Dubai International - Dubai International'); 
INSERT INTO @Company VALUES ('Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls'); 
SELECT 
    c.companyname, 
    DIFFERENCE(RTRIM(SUBSTRING(c.companyname, 0, CHARINDEX('-', c.companyname))), LTRIM(SUBSTRING(c.companyname, CHARINDEX('-', c.companyname, 0) + 1, LEN(c.companyname)))) AS Similarity 
FROM 
    @Company c; 

有了结果:

companyname Similarity 
Allen Limited - Allen Corporation 4 
Sweden Corp. - Sweden Corp. 4 
Alaska Limited - Alaska Limited, Inc. 4 
New York Inc. - New York Steel Limited 4 
India Plc - India Plc. 4 
Dubai International - Dubai International 4 
Nigera Falls Pvt. Ltd. - Amazing Nigeria Falls 1 

所以对于你的最后一个例子来说效果不太好,但对于t来说似乎很好他人呢?

+0

它的工作原理但我刚刚在上面的查询中引用了一些例子,我想检查的记录很多,可能不同,不仅因为'Inc'这个词,它可能是任何东西 - 魔术土地 - 魔法土地,趣味食物 - 趣味食物。 – user3769697

+0

Whic h是为什么我提供了两个例子,SOUNDEX DIFFERENCE将评估两个字符串的相似性,应该值得一试?就个人而言,我更喜欢更强大的距离算法(例如Jaro Winkler),但实现起来有点难,因为它没有内置到引擎中。 –

+0

是的,这与我给出的示例名称一起工作,但如果必须从名为“名称”的列中选择名称,我该如何使用它。现在我们使用了 - INSERT INTO @CompanyNames VALUES('Fun Food - Fun Food');我不理解如何将它用于列? – user3769697