2016-11-01 29 views
2

我有一组文档,我想知道每个文档的主题分布(针对不同的主题数量值)。我从this question拿了一个玩具程序。 我首先使用了gensim提供的LDA,然后我再次给出测试数据作为我的训练数据本身,以获得每个doc在训练数据中的主题分布。但我总是得到统一的主题分布。gensim LDA模块:在预测时始终获得统一的主题分布

下面是我用

import gensim 
import logging 
logging.basicConfig(filename="logfile",format='%(message)s', level=logging.INFO) 


def get_doc_topics(lda, bow): 
    gamma, _ = lda.inference([bow]) 
    topic_dist = gamma[0]/sum(gamma[0]) # normalize distribution 

documents = ['Human machine interface for lab abc computer applications', 
      'A survey of user opinion of computer system response time', 
      'The EPS user interface management system', 
      'System and human system engineering testing of EPS', 
      'Relation of user perceived response time to error measurement', 
      'The generation of random binary unordered trees', 
      'The intersection graph of paths in trees', 
      'Graph minors IV Widths of trees and well quasi ordering', 
      'Graph minors A survey'] 

texts = [[word for word in document.lower().split()] for document in documents] 
dictionary = gensim.corpora.Dictionary(texts) 
id2word = {} 
for word in dictionary.token2id:  
    id2word[dictionary.token2id[word]] = word 
mm = [dictionary.doc2bow(text) for text in texts] 
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=2, update_every=1, chunksize=10000, passes=1,minimum_probability=0.0) 

newdocs=["human system"] 
print lda[dictionary.doc2bow(newdocs)] 

newdocs=["Human machine interface for lab abc computer applications"] #same as 1st doc in training 
print lda[dictionary.doc2bow(newdocs)] 

这里的玩具代码输出:

[(0, 0.5), (1, 0.5)] 
[(0, 0.5), (1, 0.5)] 

我有一些更多的例子检查,但所有最终给出相同的等概率的结果。

这里是产生(即记录器的输出)的日志文件

adding document #0 to Dictionary(0 unique tokens: []) 
built Dictionary(42 unique tokens: [u'and', u'minors', u'generation', u'testing', u'iv']...) from 9 documents (total 69 corpus positions) 
using symmetric alpha at 0.5 
using symmetric eta at 0.5 
using serial LDA version on this node 
running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000 
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy 
-5.796 per-word bound, 55.6 perplexity estimate based on a held-out corpus of 9 documents with 69 words 
PROGRESS: pass 0, at document #9/9 
topiC#0 (0.500): 0.057*"of" + 0.043*"user" + 0.041*"the" + 0.040*"trees" + 0.039*"interface" + 0.036*"graph" + 0.030*"system" + 0.027*"time" + 0.027*"response" + 0.026*"eps" 
topiC#1 (0.500): 0.088*"of" + 0.061*"system" + 0.043*"survey" + 0.040*"a" + 0.036*"graph" + 0.032*"trees" + 0.032*"and" + 0.032*"minors" + 0.031*"the" + 0.029*"computer" 
topic diff=0.539396, rho=1.000000 

它说,“太少了更新,训练可能不会收敛”这就是我一直提高不传球到1000,但输出仍然相同。 (虽然它与收敛无关,但我也尝试过增加主题)

回答

2

问题在于将变量newdocs转换为gensim文档。 dictionary.doc2bow()确实期望一个列表,但一个单词列表。您提供了一个文档列表,以便将“人类系统”解释为一个词,但是在训练集中没有这样的词汇,因此它忽略了它。为了使我的观点更清晰的看到下面的代码的输出

import gensim 
documents = ['Human machine interface for lab abc computer applications', 
      'A survey of user opinion of computer system response time', 
      'The EPS user interface management system', 
      'System and human system engineering testing of EPS', 
      'Relation of user perceived response time to error measurement', 
      'The generation of random binary unordered trees', 
      'The intersection graph of paths in trees', 
      'Graph minors IV Widths of trees and well quasi ordering', 
      'Graph minors A survey'] 

texts = [[word for word in document.lower().split()] for document in documents] 
dictionary = gensim.corpora.Dictionary(texts) 

print dictionary.doc2bow("human system".split()) 
print dictionary.doc2bow(["human system"]) 
print dictionary.doc2bow(["human"]) 
print dictionary.doc2bow(["foo"]) 

所以纠正上面的代码所有你所要做的就是按照以下

newdocs = "human system".lower().split() 
newdocs = "Human machine interface for lab abc computer applications".lower().split() 

哦改变newdocs,顺便你观察到的行为,获得相同的概率,就是空白文档的主题分布,即一个统一的分布。

+0

完美!谢谢 !而且我还需要了解一件事情。我所做的所有这些工作的主要目标,就像问题中提到的那样,是要获得主题的主题分布。有没有更好的方式,我做了LDA之后得到它,用了我在代码中使用的小黑客(它将训练集作为测试集提供!) – MysticForce