2017-03-01 29 views
6

我正在尝试使用scikit-learn的DBSCAN实现将一堆文档群集化。首先,我使用scikit-learn的TfidfVectorizer(它是一个类型为numpy.float64的163405x13029稀疏矩阵)创建TF-IDF矩阵。然后我尝试将这个矩阵的特定子集聚类。当子集很小时(比如说,几千行),事情就可以正常工作。但是对于大型子集(成千上万行),我得到ValueError: could not convert integer scalar使用DBSCAN时出现“无法转换整数标量”错误

以下是完整回溯(idxs是指数列表):


ValueError      Traceback (most recent call last) 
<ipython-input-1-73ee366d8de5> in <module>() 
    193  # use descriptions to clusterize items 
    194  ncm_clusterizer = DBSCAN() 
--> 195  ncm_clusterizer.fit_predict(tfidf[idxs]) 
    196  idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_)) 
    197  for e in idxs_clusters: 

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight) 
    294    cluster labels 
    295   """ 
--> 296   self.fit(X, sample_weight=sample_weight) 
    297   return self.labels_ 

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight) 
    264   X = check_array(X, accept_sparse='csr') 
    265   clust = dbscan(X, sample_weight=sample_weight, 
--> 266      **self.get_params()) 
    267   self.core_sample_indices_, self.labels_ = clust 
    268   if len(self.core_sample_indices_): 

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, sample_weight, n_jobs) 
    136   # This has worst case O(n^2) memory complexity 
    137   neighborhoods = neighbors_model.radius_neighbors(X, eps, 
--> 138               return_distance=False) 
    139 
    140  if sample_weight is None: 

/usr/local/lib/python3.4/site-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance) 
    584    if self.effective_metric_ == 'euclidean': 
    585     dist = pairwise_distances(X, self._fit_X, 'euclidean', 
--> 586           n_jobs=self.n_jobs, squared=True) 
    587     radius *= radius 
    588    else: 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds) 
    1238   func = partial(distance.cdist, metric=metric, **kwds) 
    1239 
-> 1240  return _parallel_pairwise(X, Y, func, n_jobs, **kwds) 
    1241 
    1242 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds) 
    1081  if n_jobs == 1: 
    1082   # Special case to avoid picklability checks in delayed 
-> 1083   return func(X, Y, **kwds) 
    1084 
    1085  # TODO: in some cases, backend='threading' may be appropriate 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared) 
    243   YY = row_norms(Y, squared=True)[np.newaxis, :] 
    244 
--> 245  distances = safe_sparse_dot(X, Y.T, dense_output=True) 
    246  distances *= -2 
    247  distances += XX 

/usr/local/lib/python3.4/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 
    184   ret = a * b 
    185   if dense_output and hasattr(ret, "toarray"): 
--> 186    ret = ret.toarray() 
    187   return ret 
    188  else: 

/usr/local/lib/python3.4/site-packages/scipy/sparse/compressed.py in toarray(self, order, out) 
    918  def toarray(self, order=None, out=None): 
    919   """See the docstring for `spmatrix.toarray`.""" 
--> 920   return self.tocoo(copy=False).toarray(order=order, out=out) 
    921 
    922  ############################################################## 

/usr/local/lib/python3.4/site-packages/scipy/sparse/coo.py in toarray(self, order, out) 
    256   M,N = self.shape 
    257   coo_todense(M, N, self.nnz, self.row, self.col, self.data, 
--> 258      B.ravel('A'), fortran) 
    259   return B 
    260 

ValueError: could not convert integer scalar 

我使用Python 3.4.3(红帽),SciPy的0.18.1和scikit学习0.18.1。

我试过猴子补丁建议here但没有奏效。

谷歌搜索我发现一个bugfix显然解决了其他类型的稀疏矩阵(如csr)相同的问题,但不是为coo。

我试过喂DBSCAN稀疏半径邻域图(而不是一个特征矩阵),建议here,但同样的错误发生。

我试过HDBSCAN,但同样的错误发生。

我该如何解决这个问题或绕过这个?

+0

什么是'fit_predict(tfidf [idxs])'中的'idxs''。你只使用tfidf的一些值吗? –

+0

'idxs'索引列表(是的,我只使用了tfidf的一些值 - 它总共有〜163k文件,但我只使用~107k) – Parzival

+0

您是否尝试过更新scipy和scikit版本? –

回答

3

即使实现允许它,DBSCAN可能会在这种非常高维的数据(从统计的角度来看,由于维度的诅咒)产生不好的结果。

相反,我建议您使用TruncatedSVD类将TF-IDF特征向量的维数降至50或100个分量,然后在结果上应用DBSCAN

相关问题