-2

运行k-means(mllib spark scala)后,我想理解从预处理的数据(其他变换器中)获得的聚类中心mllib OneHotEncoder。如何恢复Spark中的单热编码(Scala)

中心看起来是这样的:

集群中心0 0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0, 0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]

这显然不是非常人性化的任何想法如何恢复单热编码和检索原始分类功能? 如果我查找与质心最接近的数据点(使用k-means使用的相同距离度量,我假设是欧几里得距离),然后恢复该特定数据点的编码?

回答

1

对于群集质心,不可能(强烈推荐)反转编码。想象一下你有6个原始特征“3”,它编码为[0.0,0.0,1.0,0.0,0.0,0.0]。在这种情况下,很容易从编码中提取3作为正确的特征。

但是在kmeans应用程序之后,您可能会得到一个类似于此功能的群集质心,如[0.0,0.13,0.0,0.77,0.1,0.0]。如果您想将其解码为之前的表示,例如6中的“4”,因为特征4具有最大值,那么您将丢失信息并且该模型可能会损坏。

编辑:添加一个可行的办法,以恢复从意见的答案数据点编码

如果您对数据点的ID,您可以执行选择/上的ID连接操作你分配一个数据点后在编码之前到群集以获得旧状态。

+0

谢谢!我明白你的答案。如果我查找与质心最接近的数据点(使用k-means使用的相同距离度量,我假设它是欧几里得距离),然后恢复该特定数据点的编码? –

+1

@JoãoMoura然后,我认为最简单的事情是在每个数据点上都有ID,并且在为其群集分配一个点之后,通过ID检索原始值。然后,您不需要还原编码,而是对原始数据集和编码数据集执行简单的选择/连接操作。 –