蟒蛇scikit学习，让每个主题文档LDA

我上的文本数据做一个LDA，使用例如here：我的问题是：
我怎样才能知道哪些文件对应于哪些话题？ 换句话说，例如什么文件谈论话题1？蟒蛇scikit学习，让每个主题文档LDA

这里是我的步骤：

n_features = 1000 
n_topics = 8 
n_top_words = 20

我读我的文本文件一行一行：

with open('dataset.txt', 'r') as data_file: 
    input_lines = [line.strip() for line in data_file.readlines()] 
    mydata = [line for line in input_lines]

功能打印主题：

def print_top_words(model, feature_names, n_top_words): 
    for topic_idx, topic in enumerate(model.components_): 
     print("TopiC#%d:" % topic_idx) 
     print(" ".join([feature_names[i] 
         for i in topic.argsort()[:-n_top_words - 1:-1]]))       

    print()

做一个对数据的矢量化：

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b', 
           max_features=n_features, 
           stop_words='english') 
tf = tf_vectorizer.fit_transform(mydata)

初始化LDA：

lda = LatentDirichletAllocation(n_topics=3, max_iter=5, 
           learning_method='online', 
           learning_offset=50., 
           random_state=0)

在TF数据运行LDA：

lda.fit(tf)

用上面的功能打印的结果：

print("\nTopics in LDA model:") 
tf_feature_names = tf_vectorizer.get_feature_names() 

print_top_words(lda, tf_feature_names, n_top_words)

的输出打印是：

Topics in LDA model: 
TopiC#0: 
solar road body lamp power battery energy beacon 
TopiC#1: 
skin cosmetic hair extract dermatological aging production active 
TopiC#2: 
cosmetic oil water agent block emulsion ingredients mixture

来源

2017-07-17 passion

你需要做的数据转换：

doc_topic = lda.transform(tf)

，并列出这样的doc和它的最高分主题：

for n in range(doc_topic.shape[0]): 
    topic_most_pr = doc_topic[n].argmax() 
    print("doc: {} topic: {}\n".format(n,topic_most_pr))

来源

2017-07-17 14:56:01 AHC

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation.transform

的变换方法作为输入的X.文档字矩阵X，并返回文档主题分布

所以，如果你变换传递在每个文档的，那么你可以看看这些文件有很高的（足够用于你的目的）一小部分你感兴趣的话题。

来源

2017-07-17 14:54:57

蟒蛇scikit学习，让每个主题文档LDA

回答

相关问题