2014-10-30 118 views
2

我正在对大量数据进行聚类,这些数据有两个不同的聚类。将多维集群绘制成二维图python

第一种是6维聚类,而第二种是12维聚类。现在我决定使用kmeans(因为它似乎是开始时最直观的聚类算法)。

问题是如何将这些簇映射到二维图上,以便我可以推断kmeans是否在工作。我想使用matplotlib,但任何其他的python包都可以。

群集1是这些数据类型(整数,浮点,浮点,整数,浮点,INT)

群集2的由一个簇由12种浮点类型的簇。

试图得到类似于此的输出 enter image description here 任何提示将是有用的。

+0

我是唯一不知道集群是谁的人吗? – farenorth 2014-10-30 05:57:57

+3

@ farenorth是的,你是。 – 2014-10-30 10:36:15

+0

回答这个问题:搜索“[matplotlib kmeans示例](https://duckduckgo.com/?q=matplotlib+kmeans+example)” – 2014-10-30 10:38:16

回答

1

那么搜索互联网后,得到很多奇怪的评论少解决方案。我能够弄清楚如何做到这一点。如果你正在尝试做类似的事情,这里是代码。它包含来自各种来源的代码,并且它们中的很多都由我编写/编辑。我希望它比其他人更容易理解。

该函数基于scipy中的kmeans2,它返回centroid_list和label_list。 kmeansdata是传递给kmeans2进行聚类的numpy数组,num_clusters表示传递给kmeans2的簇的数量。

该代码写回一个新的PNG文件,确保它不会覆盖别的东西。还绘制了50只集群(如果有集群的1000的,那么不要尝试输出个个)

(它是为python2.7写的,应该对其他版本的工作了,我猜。)

import numpy 
import colorsys 
import random 
import os 
from matplotlib.mlab import PCA as mlabPCA 
from matplotlib import pyplot as plt 


def get_colors(num_colors): 
    """ 
    Function to generate a list of randomly generated colors 
    The function first generates 256 different colors and then 
    we randomly select the number of colors required from it 
    num_colors  -> Number of colors to generate 
    colors   -> Consists of 256 different colors 
    random_colors  -> Randomly returns required(num_color) colors 
    """ 
    colors = [] 
    random_colors = [] 
    # Generate 256 different colors and choose num_clors randomly 
    for i in numpy.arange(0., 360., 360./256.): 
     hue = i/360. 
     lightness = (50 + numpy.random.rand() * 10)/100. 
     saturation = (90 + numpy.random.rand() * 10)/100. 
     colors.append(colorsys.hls_to_rgb(hue, lightness, saturation)) 

    for i in range(0, num_colors): 
     random_colors.append(colors[random.randint(0, len(colors) - 1)]) 
    return random_colors 


def random_centroid_selector(total_clusters , clusters_plotted): 
    """ 
    Function to generate a list of randomly selected 
    centroids to plot on the output png 
    total_clusters  -> Total number of clusters 
    clusters_plotted  -> Number of clusters to plot 
    random_list   -> Contains the index of clusters 
          to be plotted 
    """ 
    random_list = [] 
    for i in range(0 , clusters_plotted): 
     random_list.append(random.randint(0, total_clusters - 1)) 
    return random_list 

def plot_cluster(kmeansdata, centroid_list, label_list , num_cluster): 
    """ 
    Function to convert the n-dimensional cluster to 
    2-dimensional cluster and plotting 50 random clusters 
    file%d.png -> file where the output is stored indexed 
        by first available file index 
        e.g. file1.png , file2.png ... 
    """ 
    mlab_pca = mlabPCA(kmeansdata) 
    cutoff = mlab_pca.fracs[1] 
    users_2d = mlab_pca.project(kmeansdata, minfrac=cutoff) 
    centroids_2d = mlab_pca.project(centroid_list, minfrac=cutoff) 


    colors = get_colors(num_cluster) 
    plt.figure() 
    plt.xlim([users_2d[:, 0].min() - 3, users_2d[:, 0].max() + 3]) 
    plt.ylim([users_2d[:, 1].min() - 3, users_2d[:, 1].max() + 3]) 

    # Plotting 50 clusters only for now 
    random_list = random_centroid_selector(num_cluster , 50) 

    # Plotting only the centroids which were randomly_selected 
    # Centroids are represented as a large 'o' marker 
    for i, position in enumerate(centroids_2d): 
     if i in random_list: 
      plt.scatter(centroids_2d[i, 0], centroids_2d[i, 1], marker='o', c=colors[i], s=100) 


    # Plotting only the points whose centers were plotted 
    # Points are represented as a small '+' marker 
    for i, position in enumerate(label_list): 
     if position in random_list: 
      plt.scatter(users_2d[i, 0], users_2d[i, 1] , marker='+' , c=colors[position]) 

    filename = "name" 
    i = 0 
    while True: 
     if os.path.isfile(filename + str(i) + ".png") == False: 
      #new index found write file and return 
      plt.savefig(filename + str(i) + ".png") 
      break 
     else: 
      #Changing index to next number 
      i = i + 1 
    return