例如计算任意行的点积...什么是快速的方法来从两个稀疏矩阵

import numpy as np 
from scipy.sparse import csr_matrix 

X = csr_matrix([[1,2,3], [4,5,6], [7,8,9]]) 
Y = csr_matrix([[1,2,3], [4,5,6], [7,8,9], [11,12,13]]) 

# Print matrices 
X.toarray() 
[[1, 2, 3], 
[4, 5, 6], 
[7, 8, 9]] 

Y.toarray() 
[[ 1, 2, 3], 
[ 4, 5, 6], 
[ 7, 8, 9], 
[11, 12, 13]]

我有一组从X代表行索引对（X，Y）和从Y一排。我想采取相应的行的点积，但我无法弄清楚如何有效地做到这一点。

这是我已经试过

# build arbitrary combinations of row from X and row from Y. Need to calculate dot product of each pair 
x_idxs = np.array([2,2,1,0]) 
y_idxs = np.arange(Y.shape[0]) 

# current method (slow) 
def get_dot_product(x_idx, y_idx): 
    return np.dot(X[x_idx].toarray()[0], Y[y_idx].toarray()[0]) 

func_args = np.transpose(np.array([x_idxs, y_idxs])) 
np.apply_along_axis(func1d=lambda x: get_dot_product(x[0], x[1]), axis=1, arr=func_args)

其作品，但速度很慢作为X和Y得到大。有没有更高效的方法？

更新

继沃伦的优雅，但速度慢的解决方案，这里有一个更好的例子进行测试（连同基准）

X = csr_matrix(np.tile(np.repeat(1, 50000),(10000,1))) 
Y = X 
y_idxs = np.arange(Y.shape[0]) 
x_idxs = y_idxs 

import time 
start_time = time.time() 
func_args = np.transpose(np.array([x_idxs, y_idxs])) 
bg = np.apply_along_axis(func1d=lambda x: get_dot_product(x[0], x[1]), axis=1, arr=func_args) 
print("--- %s seconds ---" % (time.time() - start_time)) # 15.48 seconds 

start_time = time.time() 
ww = X[x_idxs].multiply(Y[y_idxs]).sum(axis=1) 
print("--- %s seconds ---" % (time.time() - start_time)) # 38.29 seconds

来源

2016-03-11 Ben

您是否尝试了Python 2.7和sum（imap（operator.mul，vector1，vector2））[link]（https://docs.python.org/2/library/itertools.html）sum（map（operator .mul，vector1，vector2））[link]（https://docs.python.org/3/library/itertools.html）适用于Python 3.x – Yunhe

是10000x50000是您正在使用的典型大小吗？您通常计算这些点积的行数是多少？（您更新的示例使用'y_idxs = np.arange（Y.shape [0]）' - 换句话说，*所有*行。） –

我正在使用的当前矩阵的维度为X：50K x 120K和Y：250K x 120K，我需要为Y中的每一行计算一个点积（在X中有一些随机行）。我的功能需要6或7分钟才能运行，我怀疑它可以加速很多。 – Ben

有了您的X，Y，x_idxs和y_idxs，你可以这样做：

In [160]: X[x_idxs].multiply(Y[y_idxs]).sum(axis=1) 
Out[160]: 
matrix([[ 50], 
     [122], 
     [122], 
     [ 74]])

这使用“花哨”索引（即索引任意seque取出所需的一组行），然后逐点乘法和沿轴1的和来计算点积。

结果是在numpy matrix，您可以将其转换为常规numpy数组并根据需要变平。你甚至可以用略带神秘A1属性（为getA1方法的快捷方式）：

In [178]: p = X[x_idxs].multiply(Y[y_idxs]).sum(axis=1) 

In [179]: p 
Out[179]: 
matrix([[ 50], 
     [122], 
     [122], 
     [ 74]]) 

In [180]: p.A1 
Out[180]: array([ 50, 122, 122, 74])

更新，具有定时...

这里是一个完整的脚本来比较我的版本的性能使用实际上稀疏（密度大约为0.001，即大约0.1％非零元素）的阵列X和Y。

import numpy as np 
from scipy import sparse 


def get_dot_product(x_idx, y_idx): 
    return np.dot(X[x_idx].toarray()[0], Y[y_idx].toarray()[0]) 

print("Generating random sparse integer matrix X...") 
X = (100000*sparse.rand(50000, 120000, density=0.001, format='csr')).astype(np.int64) 
X.eliminate_zeros() 
print("X has shape %s with %s nonzero elements." % (X.shape, X.nnz)) 
Y = X 
y_idxs = np.arange(Y.shape[0]) 
x_idxs = y_idxs 

import time 
start_time = time.time() 
func_args = np.transpose(np.array([x_idxs, y_idxs])) 
bg = np.apply_along_axis(func1d=lambda x: get_dot_product(x[0], x[1]), axis=1, arr=func_args) 
print("--- %8.5f seconds ---" % (time.time() - start_time)) 

start_time = time.time() 
ww = X[x_idxs].multiply(Y[y_idxs]).sum(axis=1) 
print("--- %8.5f seconds ---" % (time.time() - start_time))

输出：

Generating random sparse integer matrix X... 
X has shape (50000, 120000) with 5999934 nonzero elements. 
--- 18.29916 seconds --- 
--- 0.32749 seconds ---

对于不太稀疏矩阵，速度差不那么大，并且对于足够密集矩阵，原始的版本是更快。

来源

2016-03-11 05:52:04

这个方法比我的方法对于大'X'和'Y'慢得多，所以我不能使用它。（无论如何，因为它对于小矩阵来说非常优雅） – Ben

啊，对不起，我应该自己检查时间。感谢问题中的详细更新！ –

本，我添加了一个时间比较的例子，使用实际上稀疏的矩阵。 –

什么是快速的方法来从两个稀疏矩阵

更新

回答

相关问题