2013-03-04 184 views
2

基本上,我只是试图做一个简单的矩阵乘法,具体来说,提取它的每一列,并通过用它的长度来分割它。修改scipy稀疏矩阵到位

#csc sparse matrix 
    self.__WeightMatrix__ = self.__WeightMatrix__.tocsc() 
    #iterate through columns 
    for Col in xrange(self.__WeightMatrix__.shape[1]): 
     Column = self.__WeightMatrix__[:,Col].data 
     List = [x**2 for x in Column] 
     #get the column length 
     Len = math.sqrt(sum(List)) 
     #here I assumed dot(number,Column) would do a basic scalar product 
     dot((1/Len),Column) 
     #now what? how do I update the original column of the matrix, everything that have been returned are copies, which drove me nuts and missed pointers so much 

我已经通过scipy稀疏矩阵文档搜索,没有得到有用的信息。我希望函数能够返回一个指向矩阵的指针/引用,以便我可以直接修改它的值。谢谢

+0

你有没有试过'self.__ WeightMatrix __ [:,Col] = ...'? – Blender 2013-03-04 07:09:12

+1

我这样认为,原始值并没有改变,导致我相信[:Col]返回了一个副本,并且据我所知,似乎csc稀疏矩阵不支持直接索引,如果发生错误这样做。 – 2013-03-04 07:10:10

回答

5

在CSC格式中,您有两个可写属性,dataindices,它们包含矩阵的非零条目和相应的行索引。

def sparse_row_normalize(sps_mat) : 
    if sps_mat.format != 'csc' : 
     msg = 'Can only row-normalize in place with csc format, not {0}.' 
     msg = msg.format(sps_mat.format) 
     raise ValueError(msg) 
    row_norm = np.sqrt(np.bincount(sps_mat.indices, weights=mat.data * mat_data)) 
    sps_mat.data /= np.take(row_norm, sps_mat.indices) 

一看就知道它的实际工作:您可以按如下方式使用这些你的优势

>>> mat = scipy.sparse.rand(4, 4, density=0.5, format='csc') 
>>> mat.toarray() 
array([[ 0.  , 0.  , 0.58931687, 0.31070526], 
     [ 0.24024639, 0.02767106, 0.22635696, 0.85971295], 
     [ 0.  , 0.  , 0.13613897, 0.  ], 
     [ 0.  , 0.13766507, 0.  , 0.  ]]) 
>>> mat.toarray()/np.sqrt(np.sum(mat.toarray()**2, axis=1))[:, None] 
array([[ 0.  , 0.  , 0.88458487, 0.46637926], 
     [ 0.26076366, 0.03003419, 0.24568806, 0.93313324], 
     [ 0.  , 0.  , 1.  , 0.  ], 
     [ 0.  , 1.  , 0.  , 0.  ]]) 
>>> sparse_row_normalize(mat) 
>>> mat.toarray() 
array([[ 0.  , 0.  , 0.88458487, 0.46637926], 
     [ 0.26076366, 0.03003419, 0.24568806, 0.93313324], 
     [ 0.  , 0.  , 1.  , 0.  ], 
     [ 0.  , 1.  , 0.  , 0.  ]]) 

同时,它也numpy的快,没有Python的循环破坏的乐趣:

In [2]: mat = scipy.sparse.rand(10000, 10000, density=0.005, format='csc') 

In [3]: mat 
Out[3]: 
<10000x10000 sparse matrix of type '<type 'numpy.float64'>' 
    with 500000 stored elements in Compressed Sparse Column format> 

In [4]: %timeit sparse_row_normalize(mat) 
100 loops, best of 3: 14.1 ms per loop