我正在运行此Selecting the number of clustersscikit-learn
的示例python
。该示例获取具有2个特征的多个样本,并找到聚类的最佳k值。scikit-learn中kmeans的python内存错误
在我的情况下,我有3个功能的样本。他们确实是3 dimensional coordinates
。所以,在代码中,我只是将输入更改为我的样本,其余部分保持不变。我的样本点数很大,可能超过10,000点。
当我输入我所有的数据时,我得到了内存错误(我有16GB的内存,并且它全部满了)。但是当我放入一半数据时,它不会给出错误。虽然ipython笔记本为剪影功能显示错误,但我很确定它发生在kmeans中,并且它不会执行群集并突然跳到此错误。
对于相同数量的数据,我做了kmeans聚类在C++
,它是完全正常和快速没有任何问题。 是否有任何想法如何解决这个问题? 这是我
MemoryError Traceback (most recent call last)
<ipython-input-4-ed4b060ccea1> in <module>()
41 # This gives a perspective into the density and separation of the formed
42 # clusters
---> 43 silhouette_avg = silhouette_score(X, cluster_labels)
44 print("For n_clusters =", n_clusters,
45 "The average silhouette_score is :", silhouette_avg)
/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
82 else:
83 X, labels = X[indices], labels[indices]
---> 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
85
86
/usr/lib64/python2.7/site-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds)
141
142 """
--> 143 distances = pairwise_distances(X, metric=metric, **kwds)
144 n = labels.shape[0]
145 A = np.array([_intra_cluster_distance(distances[i], labels, i)
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in pairwise_distances(X, Y, metric, n_jobs, **kwds)
649 func = pairwise_distance_functions[metric]
650 if n_jobs == 1:
--> 651 return func(X, Y, **kwds)
652 else:
653 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared)
181 distances.flat[::distances.shape[0] + 1] = 0.0
182
--> 183 return distances if squared else np.sqrt(distances)
184
185
MemoryError:
你如何输入数据?也许它可能会产生懒散。 – chris
like this mypath =/Desktop/trainingFiles /' onlyfiles = [f for listdir(mypath)if iffile(join(mypath,f))] RESULTS_TRIORIES = [] 我在范围内(6,len(onlyfiles )): FNAME = onlyfiles [I] 文件路径= mypath中+ FNAME F =开放(文件路径, 'R') 吨= f.read()分裂( '\ n') 用于吨行: 如果行: LL = [浮子(X),用于line.split X( '')] resulted_trajectories.append(LL) all_Trajectories = np.array(resulted_trajectories) 打印(all_Trajectories) X = all_Trajectories range_n_clusters= [4,5,6,7,8,9,10] – user667222
然后我用X作为输入 – user667222