2013-04-25 24 views
4

我有距离的二维表中R A data.frame(从CSV进口):如何“平坦”或“崩溃”二维数据帧到R A 1D数据帧?

  CP000036 CP001063  CP001368 
CP000036  0   a   b 
CP001063  a   0   c 
CP001368  b   c   0 

我想 “压扁” 了。我有一个轴的第一栏的值,其他轴的在第二栏的值,然后在第三栏的距离:

Genome1  Genome2  Dist 
CP000036  CP001063  a 
CP000036  CP001368  b 
CP001063  CP001368  c 

以上是理想的,但它是完全没有重复使得在输入矩阵中的每个单元都有它自己的行:

Genome1  Genome2  Dist 
CP000036  CP000036  0 
CP000036  CP001063  a 
CP000036  CP001368  b 
CP001063  CP000036  a 
CP001063  CP001063  0 
CP001063  CP001368  c 
CP001368  CP000036  b 
CP001368  CP001063  c 
CP001368  CP001368  0 

下面是一个例子3x3矩阵,但我的数据集我要大得多(约2000×2000)。我会做这在Excel中,但我需要约3个百万行的输出,而Excel的最大值是约1万元。

这个问题是非常相似的 “如何‘扁平化’或‘崩溃’一2D Excel表格到1D?” 1

+1

as.data.frame.table? – 2013-04-25 17:36:43

回答

3

所以这是一个使用melt从包装reshape2一个解决办法:

dm <- 
    data.frame(CP000036 = c("0", "a", "b"), 
       CP001063 = c("a", "0", "c"), 
       CP001368 = c("b", "c", "0"), 
       stringsAsFactors = FALSE, 
       row.names = c("CP000036", "CP001063", "CP001368")) 

# assuming the distance follows a metric we avoid everything below and on the diagonal 
dm[ lower.tri(dm, diag = TRUE) ] <- NA 
dm$Genome1 <- rownames(dm) 

# finally melt and avoid the entries below the diagonal with na.rm = TRUE 
library(reshape2) 
dm.molten <- melt(dm, na.rm= TRUE, id.vars="Genome1", 
        value.name="Dist", variable.name="Genome2") 

print(dm.molten) 
    Genome1 Genome2 Dist 
4 CP000036 CP001063 a 
7 CP000036 CP001368 b 
8 CP001063 CP001368 c 

也许有更好的性能解决方案,但我喜欢这个,因为它的简单明了。