连续非二进制数据的简单匹配相似度矩阵？

鉴于矩阵连续非二进制数据的简单匹配相似度矩阵？

structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L, 
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L, 
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1", 
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA, 
-6L))

我想创建与匹配的比率与所有列之间的行的总数的5×5的距离矩阵。例如，X4和X3之间的距离应该是0.5，因为两列在6次中匹配3次。

我已经尝试使用软件包“proxy”中的dist(test, method="simple matching")，但此方法仅适用于二进制数据。

来源

2012-05-24 Werner

使用outer（再次:-)

my.dist <- function(x) { 
n <- nrow(x) 
d <- outer(seq.int(ncol(x)), seq.int(ncol(x)), 
      Vectorize(function(i,j)sum(x[[i]] == x[[j]])/n)) 
rownames(d) <- names(x) 
colnames(d) <- names(x) 
return(d) 
} 

my.dist(x) 
#   X1  X2 X3 X4  X5 
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333 
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667 
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000 
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000 
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000

来源

2012-05-24 04:16:42 flodel

再次感谢！这很好。 – Werner

这里有一个镜头在它（DT是您的矩阵）：

library(reshape) 
df = expand.grid(names(dt),names(dt)) 
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]])) 
cast(df,Var2~Var1)

来源

2012-05-24 04:11:12 blindjesse

这很好！非常感谢你。只有一个错误：第3行df2 = df。 – Werner

这里有一个解决方案，比其他两个快，虽然有点丑陋。我假设速度颠簸来自未使用mean()，因为它可能比sum()慢，并且也只计算输出矩阵的一半，然后手动填充下面的三角形。该功能目前离开NA对角线上的，但你可以很容易地设置这些到一个完全其他答案与diag(out) <- 1

FUN <- function(m) { 
    #compute all the combinations of columns pairs 
    combos <- t(combn(ncol(m),2)) 
    #compute the similarity index based on the criteria defined 
    sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0)/nrow(m)) 
    combos <- cbind(combos, sim) 
    #dimensions of output matrix 
    out <- matrix(NA, ncol = ncol(m), nrow = ncol(m)) 

    for (i in 1:nrow(combos)){ 
    #upper tri 
    out[combos[i, 1], combos[i, 2]] <- combos[i,3] 
    #lower tri 
    out[combos[i, 2], combos[i, 1]] <- combos[i,3] 
    } 
    return(out) 
}

符合我把其他两个答案，使他们成为功能，并做了一些基准测试：

library(rbenchmark) 
benchmark(chase(m), flodel(m), blindJessie(m), 
      replications = 1000, 
      order = "elapsed", 
      columns = c("test", "elapsed", "relative")) 
#----- 
     test elapsed relative 
1 chase(m) 1.217 1.000000 
2 flodel(m) 1.306 1.073131 
3 blindJessie(m) 17.691 14.548520

来源

2012-05-24 04:35:54 Chase

Chase，在你的代码中有一个bug：你在'transform（combos，...）'后面不能使用'combos'，因为''''会在'combos'里面被评估。我怀疑你在全球环境中有另一个'combos'副本，所以它适合你。这应该是一个简单的修复，然后在调用'transform'之前制作组合副本。 – flodel

@ flodel - 好，赶快，谢谢。进行适当的调整并重新计时。坚持矩阵和cbind也加快了功能。 – Chase

那么你可以再次运行它们，因为我也提高了答案的速度。在我的机器上，我的版本比你的版本慢了一点，但不是很多：比例降到了1.07。 – flodel

谢谢大家的建议。根据你的回答，我阐述了一个三线解决方案（“测试”是数据集的名称）。

require(proxy) 
ff <- function(x,y) sum(x == y)/NROW(x) 
dist(t(test), ff, upper=TRUE)

输出：

  X1  X2  X3  X4  X5 
X1   0.0000000 0.0000000 0.0000000 0.3333333 
X2 0.0000000   0.5000000 0.5000000 0.1666667 
X3 0.0000000 0.5000000   0.5000000 0.0000000 
X4 0.0000000 0.5000000 0.5000000   0.0000000 
X5 0.3333333 0.1666667 0.0000000 0.0000000

来源

2012-05-25 02:49:35 Werner

我无法得到这个工作，'ff'没有被定义...即使当我改变它为'f'，它失败了'错误在ascharacter（x）：不能强制类型'关闭'到'character''类型的向量 – Chase

我认为这是因为我使用的“dist”函数是package代理的函数。我将在代码中添加“require（代理）”。 – Werner

我已经得到了答案如下：月1日我已经对行数据进行一些修改为：

X1 = c(1L, 2L, 3L, 4L, 2L, 5L) 
X2 = c(2L, 3L, 4L, 5L, 3L, 6L) 
X3 = c(3L, 4L, 4L, 5L, 3L, 2L) 
X4 = c(2L, 4L, 6L, 5L, 3L, 8L) 
X5 = c(1L, 3L, 2L, 4L, 6L, 4L) 
matrix_cor=rbind(x1,x2,x3,x4,x5) 
matrix_cor 

    [,1] [,2] [,3] [,4] [,5] [,6] 
X1 1 2 3 4 2 5 
X2 2 3 4 5 3 6 
X3 3 4 4 5 3 2 
X4 2 4 6 5 3 8 
X5 1 3 2 4 6 4

则：

dist(matrix_cor) 

    X1  X2  X3  X4 
X2 2.449490       
X3 4.472136 4.242641     
X4 5.000000 3.000000 6.403124   
X5 4.358899 4.358899 4.795832 6.633250

来源

2017-02-18 14:38:16

嗨。谢谢你的回答：我编辑它，以便代码可读。将来，请格式化您的答案以方便阅读（http://stackoverflow.com/editing-help） – lbusett

连续非二进制数据的简单匹配相似度矩阵？

回答

相关问题