2014-01-09 69 views
1

我有使用WEKA.jar创建k-means算法的函数。我已经完成了创建功能并在我的控制台中显示对象列表。但是,我想显示来自k-means聚类的特定属性。从KMeans聚类获取数据库属性WEKA

这是我的语法结果:

//importing required dependencies 
import weka.core.Instance; 
import weka.experiment.InstanceQuery; 

public class KMeans { 

/*get connection strings from database manager*/ 
private DatabaseManager datman = new DatabaseManager(); 

private String username = datman.getUsername(); //get username 
private String password = datman.getPassword(); //get password 

public void doProcess(){ 
    int n = 3; 
    String queries = "SELECT idms_kodebarang, aksesoris, bahan, `QTY-SA-1`,`QTY-SA-2`,`QTY-SA-3`,`QTY-SA-4`,`harga` FROM mt_karakterproduk"; 

    try { 
     InstanceQuery query = new InstanceQuery(); 
     File reader = new File("DatabaseUtils.props"); 
     query.setUsername(username); 
     query.setPassword(password); 
     query.setQuery(queries); 
     query.initialize(reader); 
     query.setSparseData(true); 
     Instances Data = query.retrieveInstances(); 

     String[] options = weka.core.Utils.splitOptions("-I 100"); 

     SimpleKMeans kmeans = new SimpleKMeans(); 
     kmeans.setSeed(10); 
     kmeans.setOptions(options); 
     //this is the important parameter to set 
     kmeans.setNumClusters(n); 
     kmeans.setPreserveInstancesOrder(true); 
     kmeans.buildClusterer(Data); 

     EuclideanDistance Dist = (EuclideanDistance)kmeans.getDistanceFunction(); 
     Instances instances = kmeans.getClusterCentroids(); 
     //create cluster information print result 
     ClusterEvaluation eval = new ClusterEvaluation(); 
     eval.setClusterer(kmeans); 

     for (int i = 0; i < instances.numInstances(); i++) { 
      // for each cluster center 
      Instance inst = instances.instance(i); 
      Double dist1 = Dist.distance(instances.firstInstance(), Data.instance(i)); 
      // as you mentioned, you only had 1 attribute 
      // but you can iterate through the different attributes 
      double value = inst.value(0); 
      java.lang.System.out.println("Value for centroid " + i + ": " + value + " ::: " +dist1); 
     } 

     java.lang.System.out.printf("Cluster Results \n =================== \n "+eval.clusterResultsToString()); 

     //this array returns the cluster number for each instance 
     //the array has as many elements as the number of instances 
     int[] assignments = kmeans.getAssignments(); 

     int i = 0; 
     for(int clusternum : assignments){ 
      java.lang.System.out.printf("Instance %d - > cluster %d \n", i, clusternum); 
      i++; 
     } 


    } catch (Exception e) { 
     java.lang.System.out.println("Error On KMeans Analysis Exception : " + e.toString()); 
    } 

}  

}

结果只显示列表是这样的:

  • INFO:实例0 - >簇2
  • INFO:实例2 - >簇2
  • 信息:实例4 - >簇1
  • INFO:实例6 - >簇2
  • INFO:实例8 - >簇2
  • INFO:实例10 - >簇1
  • INFO:实例12 - >簇2
  • INFO:实例14 - >簇0
  • INFO:实例16 - >簇1
  • INFO:实例18 ​​- >簇1
  • INFO:实例20 - >簇1
  • INFO:实例22 - >簇1
  • INFO:实例24 - >簇0
  • INFO:实例26 - >簇0
  • INFO:实例28 - >簇1
  • INFO:实例30 - >簇1 ...等。

我需要得到的结果不仅是实例字符串,而是从数据库的特定属性。所以结果是这样的(在我的weka应用程序中)

Cluster centroids: 
            Cluster# 
Attribute    Full Data    0    1    2 
           (32)   (8)   (15)   (9) 
    ============================================================================= 
    idms_kodebarang  E501245FF3  E613104F  E501247FF3  E501245FF3 
    E501245FF3    1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E501247FF3    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E820707F$KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E820705F$KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E5016B57FF    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E5016B59FF    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E820701F$KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E613104F    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    E820708F$KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E521210F6    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E5216B10F6    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E501245C$3KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E501247C$3KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E5FF3    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E701601F    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    E613105F    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    E600201FC    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E600105C    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E620201C    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E5016B57C$KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E620501H    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E5016B59C$KB   1 ( 3%)  0 ( 0%)  0 ( 0%)  1 (11%) 
    E800601F    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E880201H    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    E931301F    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    G932201F$    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    E840104FC    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 
    E600300F    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E701104F    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E5016B50FF    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E702201F    1 ( 3%)  0 ( 0%)  1 ( 6%)  0 ( 0%) 
    E502415H6    1 ( 3%)  1 (12%)  0 ( 0%)  0 ( 0%) 

如何实现此目的?

在此先感谢。

回答

2

不确定这是否与现在有关,但我希望它可以帮助有类似问题的人。我也正在使用Weka K-Means集群API,并且ClusterEvaluation类应该以您想要的形式为您提供输出。我试了一下虹膜数据集,并得到了结果,例如:

的Weka工具K-均值聚类(集numOfClusters = 2)

=== Run information === 

Scheme:  weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 
Relation:  iris 
Instances: 150 
Attributes: 5 
       sepallength 
       sepalwidth 
       petallength 
       petalwidth 
       class 
Test mode: evaluate on training data 


=== Clustering model (full training set) === 


kMeans 
====== 

Number of iterations: 7 
Within cluster sum of squared errors: 62.1436882815797 

Initial starting points (random): 

Cluster 0: 6.1,2.9,4.7,1.4,Iris-versicolor 
Cluster 1: 6.2,2.9,4.3,1.3,Iris-versicolor 

Missing values globally replaced with mean/mode 

Final cluster centroids: 
              Cluster# 
Attribute    Full Data    0    1 
          (150.0)   (100.0)   (50.0) 
================================================================== 
sepallength     5.8433   6.262   5.006 
sepalwidth     3.054   2.872   3.418 
petallength     3.7587   4.906   1.464 
petalwidth     1.1987   1.676   0.244 
class     Iris-setosa Iris-versicolor  Iris-setosa 




Time taken to build model (full training data) : 0.02 seconds 

=== Model and evaluation on training set === 

Clustered Instances 

0  100 (67%) 
1  50 (33%) 

而且使用Weka的API为相同的数据集我的人聚类产生这种结果使用ClusterEvaluation类:

Instances instances = new Instances("iris.arff"); 
SimpleKMeans simpleKMeans = new SimpleKMeans(); 

// build clusterer 
simpleKMeans.setPreservationOrder(true); 
simpleKMeans.setNumClusters(2); 
simpleKMeans.buildClusterer(instances); 

ClusterEvaluation eval = new ClusterEvaluation(); 
eval.setClusterer(simpleKMeans); 
eval.evaluateClusterer(instances); 

System.out.println("Cluster Evaluation: "+eval.clusterResultsToString()); 

Cluster Evaluation results: 
kMeans 
====== 

Number of iterations: 7 
Within cluster sum of squared errors: 62.14368828157972 

Initial starting points (random): 

Cluster 0: 6.1,2.9,4.7,1.4,Iris-versicolor 
Cluster 1: 6.2,2.9,4.3,1.3,Iris-versicolor 

Missing values globally replaced with mean/mode 

Final cluster centroids: 
              Cluster# 
Attribute    Full Data    0    1 
          (150.0)   (100.0)   (50.0) 
================================================================== 
sepallength     5.8433   6.262   5.006 
sepalwidth     3.054   2.872   3.418 
petallength     3.7587   4.906   1.464 
petalwidth     1.1987   1.676   0.244 
class     Iris-setosa Iris-versicolor  Iris-setosa 


Clustered Instances 

0  100 (67%) 
1  50 (33%) 

我通过执行以下步骤得到了上面的代码

最后的打印行打印所需的输出。希望这可以帮助某人。

+0

thanx非常感谢您帮助我的答案^ _^...但我无法再问另一个问题,我是否可以根据特定属性打印每个群集的内容,例如,名称?或者这只是显示数据的唯一方式 –