2017-04-21 67 views
0

我有下面的代码,用scikit学习一些示例文本。我如何绘制matplotlib的Kmeans文本聚类结果?

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"] 

vect = TfidfVectorizer() 
X = vect.fit_transform(train) 
clf = KMeans(n_clusters=3) 
clf.fit(X) 
centroids = clf.cluster_centers_ 

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5) 
plt.show() 

我无法弄清楚的事情是我如何绘制聚集的结果。 X是一个csr_matrix。我想要的是(x,y)协调每个结果绘图。

回答

1

你的TF-IDF矩阵最终被3×17,所以你需要做一些投影或降维得到质心的两个维度。你有几个选择;这里是与T-SNE的例子:

import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.manifold import TSNE 

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", 
    "blue sweater", "red hat", "kitty blue"] 

vect = TfidfVectorizer() 
X = vect.fit_transform(train) 
clf = KMeans(n_clusters=3) 
data = clf.fit(X) 
centroids = clf.cluster_centers_ 

tsne_init = 'pca' # could also be 'random' 
tsne_perplexity = 20.0 
tsne_early_exaggeration = 4.0 
tsne_learning_rate = 1000 
random_state = 1 
model = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity, 
     early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate) 

transformed_centroids = model.fit_transform(centroids) 
print transformed_centroids 
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x') 
plt.show() 

在您的例子,如果你使用PCA来初始化你的T-SNE你得到广泛间隔重心;如果你使用随机初始化,你会得到微小的质心和一个无趣的图片。