2014-01-14 28 views
0

我正在使用剪影索引在KMeans聚类中选择适当数量的聚类。 Silhouette Index的代码给出了here。 基于此代码,我创建了自己的代码(请参见下文)。问题是,对于任何数据集,簇的首选数量始终等于最大值,即在本例中为15. 我的代码中是否有任何错误?用于在KMeans聚类中选择合适数量的聚类的剪影索引

private double getSilhouetteIndex(double[][] distanceMatrix,ClusterEvaluation ceval) 
{ 
    double si_index = 0; 
    double[] ca = ceval.getClusterAssignments(); 
    double[] d_arr = new double[ca.length]; 
    List<Double> si_indexes = new ArrayList<Double>(); 

    for (int i=0; i<ca.length; i++) 
    { 
     // STEP 1. Compute the average distance between the i-th point and all other points of a given cluster 
     double a = averageDist(distanceMatrix,ca,i,1); 

     // STEP 2. Compute the average distance between the i-th point and all points of other clusters 
     for (int j=0; j<ca.length; j++) 
     { 
      double d = averageDist(distanceMatrix,ca,j,2); 
      d_arr[j] = d; 
     } 

     // STEP 3. Compute the the distance from the i-th point to the nearest cluster to which it does not belong 
     double b = d_arr[0]; 
     for (Double _d : d_arr) 
     { 
      if (_d < b) 
       b = _d; 
     } 

     // STEP 4. Compute the Silhouette index for the i-th point 
     double si = (b - a)/Math.max(a,b); 

     si_indexes.add(si); 
    } 

    // STEP 5. Compute the average index over all observations 
    double sum = 0; 
    for(Double _si : si_indexes) 
    { 
     sum += _si; 
    } 
    si_index = sum/si_indexes.size(); 

    return si_index; 
} 

private double averageDist(double[][] distanceMatrix, double[] ca, int id, int calc) 
{  
    double avgDist = 0; 
    double sum = 0; 
    int len = 0; 

    // Distances inside the cluster 
    if (calc == 1) 
    { 
     for (int i = 0; i<ca.length; i++) 
     { 
      if (ca[i] == ca[id] && i != id) 
      { 
       sum += distanceMatrix[id][i]; 
       len++; 
      } 
     } 
    } 
    // Distances outside the cluster 
    else 
    { 
     for (int i = 0; i<ca.length; i++) 
     { 
      if (ca[i] != ca[id] && i != id) 
      { 
       sum += distanceMatrix[id][i]; 
       len++; 
      } 
     } 
    } 

    avgDist = sum/len; 

    return avgDist; 
} 

回答

0

对于剪影指数,据我知道,当你计算与集群外点平均距离,它实际上应该是the points from the nearest neighbor cluster,而不是所有的集群外的点。

+0

什么定义了“最近邻居簇”?最接近质心或两个最接近的边界向量的质心,数据向量是否最接近?原始代码搜索所有聚类中最接近的边界向量,因此逻辑看起来是正确的。 –