我想从scikit-learn的DictVectorizer
返回的Scipy稀疏矩阵上计算最近邻居群集。但是,当我尝试使用scikit-learn计算距离矩阵时,我通过pairwise.euclidean_distances
和pairwise.pairwise_distances
两个参数使用'euclidean'距离得到一条错误消息。我的印象是,scikit-learn可以计算这些距离矩阵。Scipy稀疏 - 距离矩阵(Scikit或Scipy)
我的矩阵高度稀疏,形状为:<364402x223209 sparse matrix of type <class 'numpy.float64'> with 728804 stored elements in Compressed Sparse Row format>
。
我也尝试过在Scipy中使用的方法,例如pdist
和kdtree
,但是收到了其他错误,无法处理结果。
任何人都可以请我指出一个解决方案,将有效地让我计算距离矩阵和/或最近的邻居结果?
一些示例代码:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
同样,如果我跑:
scipy.spatial.distance.pdist(X,'euclidean')
我得到如下:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
最后,在运行NearestNeighbor
使用scikit学习在一个内存错误的结果:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
你会得到什么样的错误?你在运行什么代码? – jorgeca
请将该信息编辑到您的问题中:一个显示您正在做什么的最简单示例,以及您获得的实际错误。 – jorgeca
谢谢!现在看起来好多了。 – jorgeca