2015-09-27 66 views
3

我有一个大的稀疏矩阵 - 使用scipy的sparse.csr_matrix。这些值是二进制的。对于每一行,我需要计算同一矩阵中每行的Jaccard距离。什么是最有效的方式来做到这一点?即使对于10.000 x 10.000矩阵,我的运行时也需要几分钟才能完成。在稀疏矩阵上计算Jaccard距离

目前的解决方案:

def jaccard(a, b): 
    intersection = float(len(set(a) & set(b))) 
    union = float(len(set(a) | set(b))) 
    return 1.0 - (intersection/union) 

def regions(csr, p, epsilon): 
    neighbors = [] 
    for index in range(len(csr.indptr)-1): 
     if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon: 
      neighbors.append(index) 
    return neighbors 
csr = scipy.sparse.csr_matrix("file") 
regions(csr, 0.51) #this is called for every row 
+0

您能详细说明导致运行时的实现细节吗? – Ryan

+0

啊,对不起。已添加代码。 –

+1

我会首先使用(推测)优化的jaccard函数,例如http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html – Ryan

回答

7

矢量化是比较容易,如果你使用的矩阵乘法计算交点集内,然后将规则|union(a, b)| == |a| + |b| - |intersection(a, b)|确定工会:

# Not actually necessary for sparse matrices, but it is for 
# dense matrices and ndarrays, if X.dtype is integer. 
from __future__ import division 

def pairwise_jaccard(X): 
    """Computes the Jaccard distance between the rows of `X`. 
    """ 
    X = X.astype(bool).astype(int) 

    intrsct = X.dot(X.T) 
    row_sums = intrsct.diagonal() 
    unions = row_sums[:,None] + row_sums - intrsct 
    dist = 1.0 - intrsct/unions 
    return dist 

注投为bool然后int,因为X的dtype必须足够大以累积最大行总和的两倍,并且X的条目必须为零或一。这个代码的缺点是RAM很重,因为unionsdists是密集矩阵。

如果你只关心的距离小于一些截止epsilon,可以将代码调整为稀疏矩阵:

from scipy.sparse import csr_matrix 

def pairwise_jaccard_sparse(csr, epsilon): 
    """Computes the Jaccard distance between the rows of `csr`, 
    smaller than the cut-off distance `epsilon`. 
    """ 
    assert(0 < epsilon < 1) 
    csr = csr_matrix(csr).astype(bool).astype(int) 

    csr_rownnz = csr.getnnz(axis=1) 
    intrsct = csr.dot(csr.T) 

    nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1)) 
    unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data 
    dists = 1.0 - intrsct.data/unions 

    mask = (dists > 0) & (dists <= epsilon) 
    data = dists[mask] 
    indices = intrsct.indices[mask] 

    rownnz = np.add.reduceat(mask, intrsct.indptr[:-1]) 
    indptr = np.r_[0, np.cumsum(rownnz)] 

    out = csr_matrix((data, indices, indptr), intrsct.shape) 
    return out 

如果这仍然需要以多少RAM你可以尝试向量化了一个维和Python循环。

+0

对于稀疏方法,因为你正在铸造CSR矩阵作为整数,当你做分部时,你只会得到0或1分数。我想你可以把它作为float来代替。 – Laggel