2016-09-02 60 views
0

如何从LDA模型(org.apache.spark.ml.clustering.LDA)中获取vocabArray。我只是得到vocabSize,它返回扫描的字数。如何将主题索引转换为LDA中的主题词

理想情况下,我需要模型中实际单词的数组,然后基于termindices,我希望看到桶内的单词。

我需要在scala中做到这一点。任何建议都会有帮助。

事情我已经尝试到现在,我的topicIndices是

topicIndices: org.apache.spark.sql.DataFrame = [topic: int, termIndices: array<int>, termWeights: array<double>] 

我试图获取这样

val topics = topicIndices.map { case (terms, termWeights) => 
     terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) } 
    } 

包含主题的数据帧,但它引发以下错误

> 

val topics = topicIndices.map { case (terms, termWeights) => 
     terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) } 
    } <console>:96: error: constructor cannot be instantiated to expected type; found : (T1, T2) required: org.apache.spark.sql.Row 
     val topics = topicIndices.map { case (terms, termWeights) => 
              ^<console>:97: error: not found: value terms 
      terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) } 
      ^
+0

您正在使用spark-shell? – eliasah

+0

我正在使用databricks笔记本进行这个实验。 – Nabs

+0

问题出在旧的mllib中LDA describeetopics用于通过主题返回数组。每个主题是(术语索引,主题中的术语权重)。在ml LDA中,说明文件正在返回[topic:int,termIndices:array ,termWeights:array ]。早些时候,很容易映射关键值对,我们应该如何映射这个新的映射关系? – Nabs

回答

2

解决了问题。这是缺少的一块。一旦你从describeetopics获得了df,这里是可以帮助获取相应单词的代码。 (注意:此代码适用于LDA的ml库)

val topicDF = model.describeTopics(maxTermsPerTopic = 10) 
for ((row) <- topicDF) { 
     val topicNumber = row.get(0) 
     val topicTerms = row.get(1) 
     println ("Topic: "+ topicNumber) 
} 

import scala.collection.mutable.WrappedArray 

val vocab = vectorizer.vocabulary 

for ((row) <- topicDF) { 
    val topicNumber = row.get(0) 
    //val terms = row.get(1) 
    val terms:WrappedArray[Int] = row.get(1).asInstanceOf[WrappedArray[Int]] 
    for ((termIdx) <- 0 until 4) { 
     println("Topic:" + topicNumber + " Word:" + vocab(termIdx)) 
    } 
} 

topicDF.printSchema 
import org.apache.spark.sql.Row 

topicDF.collect().foreach { r => 
       r match { 
         case _: Row => ("Topic:" + r) 
         case unknow => println("Something Else") 
     } 
} 

topicDF.collect().foreach { r => { 
         println("Topic:" + r(0)) 
         val terms:WrappedArray[Int] = r(1).asInstanceOf[WrappedArray[Int]] 
         terms.foreach { 
           t => { 
             println("Term:" + vocab(t)) 
           } 
         } 
       } 
     }