2017-07-17 35 views
1

我上的文本数据做一个LDA,使用例如here: 我的问题是:
我怎样才能知道哪些文件对应于哪些话题? 换句话说,例如什么文件谈论话题1?蟒蛇scikit学习,让每个主题文档LDA

这里是我的步骤:

n_features = 1000 
n_topics = 8 
n_top_words = 20 

我读我的文本文件一行一行:

with open('dataset.txt', 'r') as data_file: 
    input_lines = [line.strip() for line in data_file.readlines()] 
    mydata = [line for line in input_lines] 

功能打印主题:

def print_top_words(model, feature_names, n_top_words): 
    for topic_idx, topic in enumerate(model.components_): 
     print("TopiC#%d:" % topic_idx) 
     print(" ".join([feature_names[i] 
         for i in topic.argsort()[:-n_top_words - 1:-1]]))       

    print() 

做一个对数据的矢量化:

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b', 
           max_features=n_features, 
           stop_words='english') 
tf = tf_vectorizer.fit_transform(mydata) 

初始化LDA:

lda = LatentDirichletAllocation(n_topics=3, max_iter=5, 
           learning_method='online', 
           learning_offset=50., 
           random_state=0) 

在TF数据运行LDA:

lda.fit(tf) 

用上面的功能打印的结果:

print("\nTopics in LDA model:") 
tf_feature_names = tf_vectorizer.get_feature_names() 

print_top_words(lda, tf_feature_names, n_top_words) 

的输出打印是:

Topics in LDA model: 
TopiC#0: 
solar road body lamp power battery energy beacon 
TopiC#1: 
skin cosmetic hair extract dermatological aging production active 
TopiC#2: 
cosmetic oil water agent block emulsion ingredients mixture 

回答

5

你需要做的数据转换:

doc_topic = lda.transform(tf) 

,并列出这样的doc和它的最高分主题:

for n in range(doc_topic.shape[0]): 
    topic_most_pr = doc_topic[n].argmax() 
    print("doc: {} topic: {}\n".format(n,topic_most_pr))