Kmeans - group by - 优文库

我想为numClusters = 6做kmeans标签，以便稍后可以按标签分组。Kmeans - group by

如何选择要做kmeans的列？

val clusterThis = scaledDF.select($"id",$"setting1",$"setting2",$"setting3") 

// dataset description lists six operation modes 
val operatingModes = 6 

// Cluster the data into two classes using KMeans 
val numClusters = operatingModes 
val numIterations = 20 

import sqlContext.implicits._ 
val clusters = KMeans.train(clusterThis.rdd, numClusters, numIterations) 
clusters.predict(clusterThis) 

//... join back on id

来源

2016-04-09 oluies

你使用'ML'还是'MLLib'？ –

我可以使用任何如果它的可用性，我认为上述使用rdd/MLLib – oluies

啊ML有一个很好的检查 https://spark.apache.org/docs/latest/ml-clustering.html – oluies

正如你可以看到KMeans's Example对象使用只有一个列features。在这个例子中，巧合的是它有相同的名字。然而，这个名字取决于你，但重要的是这个列必须是Vector（密集或稀疏）。

因此，您需要将结合将您的功能（不同列）合并为一个，对于此任务您可以使用VectorAssembler。

顺便说一句，K-手段不类别特征工作。你可以阅读这个帖子K-means clustering for mixed numeric and categorical data以注意到原因。

来源

2016-04-09 20:14:57

很好的解释和谢谢供参考！ :) – eliasah

Kmeans - group by

回答

相关问题