2012-11-21 66 views
2

可能重复:
Find cosine similarity in R计算余弦相似度,如果数据包含NA值

我有类似这样的在河,我想找到之间的余弦相似一张大桌子每个项目,例如对(91,93),(91,99),(91,100)...(101,125)。最终输出应该是

No_1 No_2 Similarity 
... 
6518 6763 0.974 
… 

表看起来像这样。

 No_ Product.Group.Code R1 R2 R3 R4 S1 S2 S3 U1 U2 U3 U4 U6 
91 65418    164 0.68 0.70 0.50 0.59 NA NA 0.96 NA 0.68 NA NA NA 
93 57142    164 NA 0.94 NA NA 0.83 NA NA 0.54 NA NA NA NA 
99 66740    164 0.68 0.68 0.74 NA 0.63 0.68 0.72 NA NA NA NA NA 
100 76712    164 0.54 0.54 0.40 NA 0.39 0.39 0.39 0.50 NA 0.50 NA NA 
101 56463    164 0.67 0.67 0.76 NA NA 0.76 0.76 0.54 NA NA NA NA 
125 11713    164 NA NA NA NA NA 0.88 NA NA NA NA NA NA 

因为有些行有NA,我写了一些辅助功能,只比较其中两排的不是NA列。

compareNA <- function(v1,v2) { 
    same <- (!is.na(v1) & !is.na(v2)) 
    same[is.na(same)] <- FALSE 
    return(same) 
} 

selectTRUE <- function(v1, truth) { 
    # This function selects only the variables which correspond to the truth vector 
    # being true. 
    for (colname in colnames(v1)) { 
     if(!truth[ ,colname]) { 
      v1[colname] <- NULL 
     } 
    } 
    return(v1) 
} 

trimAndTuck <- function(v1){ 
    # Turns list into vector and removes first two columns 
    return (unlist(v1, use.names = FALSE)[-(1:2)]) 
} 

cosineSimilarity <- function(v1, v2) { 
    truth <- compareNA(v1, v2) 
    return (cosine(
       trimAndTuck(selectTRUE(v1, truth)), 
       trimAndTuck(selectTRUE(v2, truth)) 
       )) 
} 

allPairs <- function(df){ 
    for (i in 1:length(df)) { 
     for (j in 1:length(df)) { 
      print(cosineSimilarity(df[i,], df[j,])) 
     } 
    } 
} 

运行allpairs不给我正确的答案,但它在一系列的1x1载体这样做。我很清楚,我写的东西可能是对功能性神的侮辱,但我不知道要如何写它。

如何重写(矢量化?)以便它以正确的格式返回数据?

编辑:我使用的余弦函数是LSA包的一部分。这是关于用余弦函数处理NA值,而不是如何计算标准余弦相似度。

+1

函数R包中的余弦函数? –

+1

也许这[so-question](http://stackoverflow.com/questions/2535234/find-cosine-similarity-in-r)(可能重复)会有所帮助。按照接受的答案的指示。 – sgibb

+0

这不是重复的,因为它是关于如何在使用余弦函数时处理NA值。 – Roland

回答

3
#data 
df <- read.table(text="No_ Product.Group.Code R1 R2 R3 R4 S1 S2 S3 U1 U2 U3 U4 U6 
91 65418    164 0.68 0.70 0.50 0.59 NA NA 0.96 NA 0.68 NA NA NA 
93 57142    164 NA 0.94 NA NA 0.83 NA NA 0.54 NA NA NA NA 
99 66740    164 0.68 0.68 0.74 NA 0.63 0.68 0.72 NA NA NA NA NA 
100 76712    164 0.54 0.54 0.40 NA 0.39 0.39 0.39 0.50 NA 0.50 NA NA 
101 56463    164 0.67 0.67 0.76 NA NA 0.76 0.76 0.54 NA NA NA NA 
125 11713    164 NA NA NA NA NA 0.88 NA NA NA NA NA NA",header=TRUE) 

#remove second column 
df <- df[,-2] 

#transform to long format 
library(reshape2) 
df <- melt(df,id.vars="No_") 

#cosine similarity taken from package lsa 
#I could not load package lsa, because I lack Java on my system 
cosine <- function(x, y=NULL) { 

    if (is.matrix(x) && is.null(y)) { 

    co = array(0,c(ncol(x),ncol(x))) 
    f = colnames(x) 
    dimnames(co) = list(f,f) 

    for (i in 2:ncol(x)) { 
     for (j in 1:(i-1)) { 
     co[i,j] = cosine(x[,i], x[,j]) 
     } 
    } 
    co = co + t(co) 
    diag(co) = 1 

    return (as.matrix(co)) 

    } else if (is.vector(x) && is.vector(y)) { 
    return (crossprod(x,y)/sqrt(crossprod(x)*crossprod(y))) 
    } else { 
    stop("argument mismatch. Either one matrix or two vectors needed as input.") 
    } 

} 

#define function that removes NA before calculating the similarity 
cosine2 <- function(x,y) cosine(na.omit(cbind(x,y))) 

#pairwise comparisons 
i <- outer(unique(df$No_),unique(df$No_),FUN=function(i,j) i) 
j <- outer(unique(df$No_),unique(df$No_),FUN=function(i,j) j) 

i <- i[!lower.tri(i)] 
j <- j[!lower.tri(j)] 

comp <- function(ind) { 
    res <- cosine2(df$value[df$No_==i[ind]],df$value[df$No_==j[ind]])[1,2] 
    list(No1=as.character(i[ind]),No2=as.character(j[ind]),CosSim=res) 
} 

res <- as.data.frame(t(sapply(seq_along(i),FUN="comp"))) 

    No1 No2 CosSim 
1 65418 65418   1 
2 65418 57142   1 
3 57142 57142   1 
4 65418 66740 0.9724159 
5 57142 66740 0.999714 
6 66740 66740   1 
7 65418 76712 0.9569313 
8 57142 76712 0.9684678 
9 66740 76712 0.9854669 
10 76712 76712   1 
11 65418 56463 0.9741412 
12 57142 56463 0.9877108 
13 66740 56463 0.9989167 
14 76712 56463 0.9708716 
15 56463 56463   1 
16 65418 11713  NaN 
17 57142 11713  NaN 
18 66740 11713   1 
19 76712 11713   1 
20 56463 11713   1 
21 11713 11713   1