如何将主题索引转换为LDA中的主题词

如何从LDA模型（org.apache.spark.ml.clustering.LDA）中获取vocabArray。我只是得到vocabSize，它返回扫描的字数。如何将主题索引转换为LDA中的主题词

理想情况下，我需要模型中实际单词的数组，然后基于termindices，我希望看到桶内的单词。

我需要在scala中做到这一点。任何建议都会有帮助。

事情我已经尝试到现在，我的topicIndices是

topicIndices: org.apache.spark.sql.DataFrame = [topic: int, termIndices: array<int>, termWeights: array<double>]

我试图获取这样

val topics = topicIndices.map { case (terms, termWeights) => 
     terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) } 
    }

包含主题的数据帧，但它引发以下错误

> 

val topics = topicIndices.map { case (terms, termWeights) => 
     terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) } 
    } <console>:96: error: constructor cannot be instantiated to expected type; found : (T1, T2) required: org.apache.spark.sql.Row 
     val topics = topicIndices.map { case (terms, termWeights) => 
              ^<console>:97: error: not found: value terms 
      terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) } 
      ^

来源

2016-09-02 Nabs

您正在使用spark-shell？ – eliasah

我正在使用databricks笔记本进行这个实验。 – Nabs

问题出在旧的mllib中LDA describeetopics用于通过主题返回数组。每个主题是（术语索引，主题中的术语权重）。在ml LDA中，说明文件正在返回[topic：int，termIndices：array ，termWeights：array ]。早些时候，很容易映射关键值对，我们应该如何映射这个新的映射关系？ – Nabs

解决了问题。这是缺少的一块。一旦你从describeetopics获得了df，这里是可以帮助获取相应单词的代码。（注意：此代码适用于LDA的ml库）

val topicDF = model.describeTopics(maxTermsPerTopic = 10) 
for ((row) <- topicDF) { 
     val topicNumber = row.get(0) 
     val topicTerms = row.get(1) 
     println ("Topic: "+ topicNumber) 
} 

import scala.collection.mutable.WrappedArray 

val vocab = vectorizer.vocabulary 

for ((row) <- topicDF) { 
    val topicNumber = row.get(0) 
    //val terms = row.get(1) 
    val terms:WrappedArray[Int] = row.get(1).asInstanceOf[WrappedArray[Int]] 
    for ((termIdx) <- 0 until 4) { 
     println("Topic:" + topicNumber + " Word:" + vocab(termIdx)) 
    } 
} 

topicDF.printSchema 
import org.apache.spark.sql.Row 

topicDF.collect().foreach { r => 
       r match { 
         case _: Row => ("Topic:" + r) 
         case unknow => println("Something Else") 
     } 
} 

topicDF.collect().foreach { r => { 
         println("Topic:" + r(0)) 
         val terms:WrappedArray[Int] = r(1).asInstanceOf[WrappedArray[Int]] 
         terms.foreach { 
           t => { 
             println("Term:" + vocab(t)) 
           } 
         } 
       } 
     }

来源

2016-09-02 20:03:14 Nabs

如何将主题索引转换为LDA中的主题词

回答

相关问题