2017-04-26 50 views
0

元件之间的成对协方差我有以下的数据帧:计算意味着在一个列表

# df1 
id cg_v 
1  a 
2  b 
3  a b 
4  b c 
5 b c d 
6  d 

# df2 
id cg 
1 a 
2 b 
3 a 
3 b 
4 b 
4 c 
5 b 
5 c 
5 d 
6 d 

我需要将列添加到df1包含在cg_v横过每对元件的计算出的平均方差。如果cg_v只包含一个元素,那么我希望新列包含其方差。

我可以通过cov(crossprod(table(df2)))

#   a   b   c   d 
a 0.9166667 0.0000000 -0.5833333 -0.6666667 
b 0.0000000 2.0000000 1.0000000 0.0000000 
c -0.5833333 1.0000000 0.9166667 0.3333333 
d -0.6666667 0.0000000 0.3333333 0.6666667 

我怎么在这里做了协方差矩阵?

最终的结果应该是这样的:

# df1 
id cg_v  cg_cov 
1  a 0.9166667 
2  b 2.0000000 
3  a b 0.0000000 
4  b c 1.0000000 
5 b c d 0.4444444 # This is equal to (1.0000000 + 0.3333337 + 0.0000000)/3 
6  d 0.6666667 

代码生成df1df2

df1 <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L), 
         cg_v = c("a", "b", "a b", "b c", "b c d", "d")), 
       .Names = c("id", "cg_v"), 
       class = "data.frame", row.names = c(NA, -6L)) 

df2 <- structure(list(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L), 
         cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d")), 
       .Names = c("id", "cg"), 
       class = "data.frame", row.names = c(NA, -10L)) 

回答

1

我想我找到了这个问题的解决方案使用data.tables和重塑。你想用三个字母b c d做什么?我假设你想拥有的前两个字母的协方差:

 require(reshape) 
     require(data.table) 
     dt1 <- data.table(id = c(1L, 2L, 3L, 4L, 5L, 6L), 
          cg_v = c("a", "b", "a b", "b c", "b c d", "d")) 
     dt2 <- data.table(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L), 
           cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d")) 
     cov_dt <- data.table(melt(cov(crossprod(table(df2))))) 
     dt1 <- cbind(dt1, t(sapply(strsplit(as.character(df1$cg_v), " "), function(x)x[1:2]))) 
     #replace the na with the first colomn 
     dt1[is.na(V2), V2 := V1] 

     # Merge them on two columns 
     setkey(dt1, "V1", "V2") 
     setkey(cov_dt, "X1", "X2") 
     result <- cov_dt[dt1] 
> result[,.(id, cg_v, value)] 
    id cg_v  value 
1: 1  a 0.9166667 
2: 3 a b 0.0000000 
3: 2  b 2.0000000 
4: 4 b c 1.0000000 
5: 5 b c d 1.0000000 
6: 6  d 0.6666667 

变,如果有超过2个字母(不是最高效的代码),它也可以工作:

require(reshape) 
require(combinat) 
df1 <- data.frame(id = c(1L, 2L, 3L, 4L, 5L, 6L), 
        cg_v = c("a", "b", "a b", "b c", "b c d", "d")) 
df2 <- data.frame(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L), 
         cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d")) 
cov_dt <- cov(crossprod(table(df2))) 
mat <- sapply(strsplit(as.character(df1$cg_v), " "), function(x) if(length(x) == 1){c(x,x)} else(x)) 
# Should be all minimal 2 
sapply(mat, length) > 1 
mat <- sapply(mat, function(x) matrix(combn(x,2), nrow = 2)) 
df1$cg_cov <- sapply(mat, function(x) mean(apply(x,2, function(x) cov_dt[x[1],x[2]]))) 
> df1 
    id cg_v cg_cov 
1 1  a 0.9166667 
2 2  b 2.0000000 
3 3 a b 0.0000000 
4 4 b c 1.0000000 
5 5 b c d 0.4444444 
6 6  d 0.6666667 
+0

不,我将需要cov(b,c),cov(c,d)和cov(b,d)的均值。那是(1.0000000 + 0.3333337 + 0.0000000)/ 3 = 0.4444444。 – Michele

+0

我编辑我的解决方案,使其工作,如果使用更多的字母 –

+0

它适用于该示例,但不适用于我的数据。运行'mat < - sapply(mat,function(x)matrix(combn(x,2),nrow = 2))'时出错。错误是:'combn(x,2)中的错误:n Michele