运行k-means(mllib spark scala)后,我想理解从预处理的数据(其他变换器中)获得的聚类中心mllib OneHotEncoder。如何恢复Spark中的单热编码(Scala)
中心看起来是这样的:
集群中心0 0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0, 0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
这显然不是非常人性化的任何想法如何恢复单热编码和检索原始分类功能? 如果我查找与质心最接近的数据点(使用k-means使用的相同距离度量,我假设是欧几里得距离),然后恢复该特定数据点的编码?
谢谢!我明白你的答案。如果我查找与质心最接近的数据点(使用k-means使用的相同距离度量,我假设它是欧几里得距离),然后恢复该特定数据点的编码? –
@JoãoMoura然后,我认为最简单的事情是在每个数据点上都有ID,并且在为其群集分配一个点之后,通过ID检索原始值。然后,您不需要还原编码,而是对原始数据集和编码数据集执行简单的选择/连接操作。 –