0
如何从LDA模型(org.apache.spark.ml.clustering.LDA)中获取vocabArray。我只是得到vocabSize,它返回扫描的字数。如何将主题索引转换为LDA中的主题词
理想情况下,我需要模型中实际单词的数组,然后基于termindices,我希望看到桶内的单词。
我需要在scala中做到这一点。任何建议都会有帮助。
事情我已经尝试到现在,我的topicIndices是
topicIndices: org.apache.spark.sql.DataFrame = [topic: int, termIndices: array<int>, termWeights: array<double>]
我试图获取这样
val topics = topicIndices.map { case (terms, termWeights) =>
terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
}
包含主题的数据帧,但它引发以下错误
>
val topics = topicIndices.map { case (terms, termWeights) =>
terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
} <console>:96: error: constructor cannot be instantiated to expected type; found : (T1, T2) required: org.apache.spark.sql.Row
val topics = topicIndices.map { case (terms, termWeights) =>
^<console>:97: error: not found: value terms
terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
^
您正在使用spark-shell? – eliasah
我正在使用databricks笔记本进行这个实验。 – Nabs
问题出在旧的mllib中LDA describeetopics用于通过主题返回数组。每个主题是(术语索引,主题中的术语权重)。在ml LDA中,说明文件正在返回[topic:int,termIndices:array,termWeights:array ]。早些时候,很容易映射关键值对,我们应该如何映射这个新的映射关系? –
Nabs