集群中的大距离矩阵

我在具有16 GB RAM的机器上运行R 3.2.3。我有一个3,00,000行×12列的大矩阵。我想在R中使用层次聚类算法，所以在我这样做之前，我试图创建一个距离矩阵。由于数据是混合类型，因此我使用不同类型的不同矩阵。我获取有关内存分配一个错误：集群中的大距离矩阵

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5)) 
d1=as.dist(distm(df[,c(1:2)])/10^5) 
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

我得到以下错误

> d1=as.dist(distm(df1[,c(1:2)])/10^5) 
Error: cannot allocate vector of size 670.6 Gb 
In addition: Warning messages: 
1: In matrix(0, ncol = n, nrow = n) : 
Reached total allocation of 16070Mb: see help(memory.size) 
2: In matrix(0, ncol = n, nrow = n) : 
Reached total allocation of 16070Mb: see help(memory.size) 
3: In matrix(0, ncol = n, nrow = n) : 
Reached total allocation of 16070Mb: see help(memory.size) 
4: In matrix(0, ncol = n, nrow = n) : 
Reached total allocation of 16070Mb: see help(memory.size) 
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb 
In addition: Warning messages: 
1: In dist(df1[, c(3:8)], method = "euclidean") : 
Reached total allocation of 16070Mb: see help(memory.size) 
2: In dist(df1[, c(3:8)], method = "euclidean") : 
Reached total allocation of 16070Mb: see help(memory.size) 
3: In dist(df1[, c(3:8)], method = "euclidean") : 
Reached total allocation of 16070Mb: see help(memory.size) 
4: In dist(df1[, c(3:8)], method = "euclidean") : 
Reached total allocation of 16070Mb: see help(memory.size) 
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.) 
Error: cannot allocate vector of size 670.6 Gb 
In addition: Warning messages: 
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) : 
Reached total allocation of 16070Mb: see help(memory.size) 
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) : 
Reached total allocation of 16070Mb: see help(memory.size) 
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) : 
Reached total allocation of 16070Mb: see help(memory.size) 
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) : 
Reached total allocation of 16070Mb: see help(memory.size)

来源

2015-12-15 Kanika Singhal

您不需要一起处理所有数据，这些数据会消耗所有内存和出错。考虑逐批处理它们，例如每次10000个载体。 – Patric

但是在聚类中，我们需要计算从一行到所有其他行的距离。那么批处理计算如何在这里帮助？ –

是的，但你可以做最后的减少选择最小/最大的一个。这有意义吗？为了实现高效计算距离，您可以参考[这里]（http://stackoverflow.com/questions/27847196/distance-calculation-on-large-vectors-performance/33409695#33409695）。通过选择最小/最大值减少 – Patric

为了简单，让假设你有1行（A）与3^8矩阵（B）通过群集最小距离。

原来的做法是：

1. load A and B 
2. distance compute A with each row of B 
3. select smallest one from results (reduction)

但由于B的实在是大，你不能把它加载到内存或执行过程中的错误了。

批量化的方法将是这样的：

1. load A (suppose it is small) 
2. load B.partial with 1 to 1^5 rows of B 
3. compute distance of A with each row of B.partial 
4. select min one in partial results and save it as res[i] 
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000] 
7. reduction : select min one from res[1:3000] 
    note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

的代码会比原来稍微复杂一些。但是，当我们处理大数据问题时，这是非常常见的技巧。对于计算部分，您可以参考我之前在here中的答案之一。

如果你可以在这里用批处理模式粘贴你的最终代码，我将非常合适。这样别人也可以学习。

约dist另一个有趣的事情是，它是R中包支持OpenMP的几个之一。请参阅here中的源代码以及如何使用here中的openMP进行编译。

所以，如果你可以尝试设置OMP_NUM_THREADS与4或8基于你的机器，然后再次运行，你可以看到性能的提高很多！

void R_distance(double *x, int *nr, int *nc, double *d, int *diag, 
    int *method, double *p) 
{ 
    int dc, i, j; 
    size_t ij; /* can exceed 2^31 - 1 */ 
    double (*distfun)(double*, int, int, int, int) = NULL; 
    #ifdef _OPENMP 
     int nthreads; 
    #endif 
    ..... 
}

此外，如果你想加速通过GPU dist，你可以参考谈话部分ParallelR。

来源

2015-12-15 05:46:09 Patric

集群中的大距离矩阵

回答

相关问题