比较多个数值列，以确定记录相似

我通过跨所有列相似性具有带有ID和数据整数数值从-5到5的列包括0比较多个数值列，以确定记录相似

╔════╦══════╦══════╦══════╦══════╗ 
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║ 
╠════╬══════╬══════╬══════╬══════╣ 
║ A ║ -5 ║ -2 ║ 0 ║ -2 ║ 
║ B ║ 0 ║ 1 ║ -1 ║ 3 ║ 
║ C ║ 1 ║ -2 ║ -3 ║ 1 ║ 
║ D ║ -1 ║ -1 ║ 5 ║ 0 ║ 
║ E ║ 2 ║ -3 ║ 1 ║ -2 ║ 
║ F ║ -3 ║ 1 ║ -2 ║ -1 ║ 
║ G ║ -4 ║ -1 ║ -1 ║ -3 ║ 
╚════╩══════╩══════╩══════╩══════╝

欲组ID的。例如上面的ID A和G类似，因为它们在每列中的值非常相似。

╔════╦══════╦══════╦══════╦══════╗ 
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║ 
╠════╬══════╬══════╬══════╬══════╣ 
║ A ║ -5 ║ -2 ║ 0 ║ -2 ║ 
║ G ║ -4 ║ -1 ║ -1 ║ -3 ║ 
╚════╩══════╩══════╩══════╩══════╝

在另一方面A和B是不同的

╔════╦══════╦══════╦══════╦══════╗ 
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║ 
╠════╬══════╬══════╬══════╬══════╣ 
║ A ║ -5 ║ -2 ║ 0 ║ -2 ║ 
║ B ║ 0 ║ 1 ║ -1 ║ 3 ║ 
╚════╩══════╩══════╩══════╩══════╝

对于给定的ID对我正在考虑在每一列中计算的差值，然后将所述差异，以获得相似性得分（较大数字不太相似）。在这个时候它是我拥有的最好主意，但我更乐于接受更准确或有效的方法。要做到这一点（使用列中的值之差的绝对值）

来源

2014-10-03 Brian Badge

要小心，要使用距离的绝对值，否则一些差异可能会相互抵消，例如：（（5-0）+（0-5））。根据你的定义，这两者会有所不同，但一个天真的实现将标记它们是相同的。 – Sirko 2014-10-03 18:16:03

为什么不求和绝对差值：'score = ABS（5-0）+ ABS（0-5）+ ...' – Rimas 2014-10-03 18:33:03

这里的麻烦是什么决定了“相似”，如果每一个都是1是相似的？ 2呢？如果4个中的3个是相同的，并且其中一个关闭了2，那么...因此，整个ROW和所有4列的比较......这里有太多的模糊逻辑来定义“相似” – xQbert 2014-10-03 19:08:25

一种方法是如下：

with all_compared as (
    select a.id as ID, 
     b.id as CompID, 
     abs(a.col1 - b.col1) + abs(a.col2 - b.col2) + abs(a.col3 - b.col3) + abs(a.col4 - b.col4) as TotalDiff 
    from stuff a, 
     stuff b 
    where a.id != b.id 
), 
    ranked_data as (
    select ID, 
     CompID, 
     TotalDiff, 
     rank() over (partition by ID order by TotalDiff) Rnk 
    from all_compared 
) 
select * 
    from ranked_data 
where rnk = 1;

我已经做了SQL小提琴展示我是如何走到这一步一步的位置： http://sqlfiddle.com/#!4/fef06/14

然后，您将需要决定如何处理的关系，因为这使输出：

enter image description here

它使用一个笛卡尔积（一个表中的所有行连接到另一个表中的所有行）与一个自连接进行比较，每行与其他行进行比较，然后汇总col1，2等之间的绝对差值。然后我们通过排序总和差异并选择最高排名。

另一种方法是使用平方距离而不是绝对差值，这会放大较大的差异，所以您需要考虑是否需要这个。

例 1,1- & 0,5会得到25，为（0-5）^ 2是25，其将计为大于0,3 & -4更少相似，-1这将获得18（3^2 + 3^2）与绝对差异一样，第一个将被视为更相似，因为所有差异都以相同的权重处理。

的平方距离的版本是：

with all_compared as (
    select a.id as ID, 
     b.id as CompID, 
     power(a.col1 - b.col1, 2) + 
      power(a.col2 - b.col2, 2) + 
      power(a.col3 - b.col3, 2) + 
      power(a.col4 - b.col4, 2) as SqDist 
    from stuff a, 
     stuff b 
    where a.id != b.id 
), 
    ranked_data as (
    select ID, 
     CompID, 
     SqDist, 
     rank() over (partition by ID order by SqDist) Rnk 
    from all_compared 
) 
select * 
    from ranked_data 
where rnk = 1;

enter image description here

或者，你可以同时使用，只是使用的平方距离，以解决关系：

with all_compared as (
    select a.id as ID, 
     b.id as CompID, 
     abs(a.col1 - b.col1) + abs(a.col2 - b.col2) + abs(a.col3 - b.col3) + abs(a.col4 - b.col4) as TotalDiff, 
     power(a.col1 - b.col1, 2) + 
      power(a.col2 - b.col2, 2) + 
      power(a.col3 - b.col3, 2) + 
      power(a.col4 - b.col4, 2) as SqDist 
    from stuff a, 
     stuff b 
    where a.id != b.id 
), 
    ranked_data as (
    select ID, 
     CompID, 
     TotalDiff, 
     SqDist, 
     rank() over (partition by ID order by TotalDiff, SqDist) Rnk 
    from all_compared 
) 
select * 
    from ranked_data 
where rnk = 1;

enter image description here

来源

2014-10-03 19:56:58 ChrisProsser

比较多个数值列，以确定记录相似

回答

相关问题