我在具有16 GB RAM的机器上运行R 3.2.3。我有一个3,00,000行×12列的大矩阵。我想在R中使用层次聚类算法,所以在我这样做之前,我试图创建一个距离矩阵。由于数据是混合类型,因此我使用不同类型的不同矩阵。我获取有关内存分配一个错误:集群中的大距离矩阵
df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean")
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
我得到以下错误
> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean")
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
您不需要一起处理所有数据,这些数据会消耗所有内存和出错。考虑逐批处理它们,例如每次10000个载体。 – Patric
但是在聚类中,我们需要计算从一行到所有其他行的距离。那么批处理计算如何在这里帮助? –
是的,但你可以做最后的减少选择最小/最大的一个。这有意义吗?为了实现高效计算距离,您可以参考[这里](http://stackoverflow.com/questions/27847196/distance-calculation-on-large-vectors-performance/33409695#33409695)。通过选择最小/最大值减少 – Patric