2017-07-19 43 views
0

我正在尝试使用双元生成词云。我能够生成前30个区分性词语,但无法在绘图时一起显示单词。我的文字云图像仍然看起来像一个单克云。我使用了以下脚本和sci-kit学习软件包。使用python创建n-gram词云

def create_wordcloud(pipeline): 
""" 
Create word cloud with top 30 discriminative words for each category 
""" 

class_labels = numpy.array(['Arts','Music','News','Politics','Science','Sports','Technology']) 

feature_names =pipeline.named_steps['vectorizer'].get_feature_names() 
word_text=[] 

for i, class_label in enumerate(class_labels): 
    top30 = numpy.argsort(pipeline.named_steps['clf'].coef_[i])[-30:] 

    print("%s: %s" % (class_label," ".join(feature_names[j]+"," for j in top30))) 

    for j in top30: 
     word_text.append(feature_names[j]) 
    #print(word_text) 
    wordcloud1 = WordCloud(width = 800, height = 500, margin=10,random_state=3, collocations=True).generate(' '.join(word_text)) 

    # Save word cloud as .png file 
    # Image files are saved to the folder "classification_model" 
    wordcloud1.to_file(class_label+"_wordcloud.png") 

    # Plot wordcloud on console 
    plt.figure(figsize=(15,8)) 
    plt.imshow(wordcloud1, interpolation="bilinear") 
    plt.axis("off") 
    plt.show() 
    word_text=[] 

这是我的管道代码

pipeline = Pipeline([ 

# SVM using TfidfVectorizer 
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(2, 2),sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)), 
('clf',  LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)) 
]) 

这些都是我的类别“艺术”

Arts: cosmetics businesspeople, television personality, reality television, television presenters, actors london, film producers, actresses television, indian film, set index, actresses actresses, television actors, century actors, births actors, television series, century actresses, actors television, stand comedian, television personalities, television actresses, comedian actor, stand comedians, film actresses, film actors, film directors 

回答

0

我想你需要以某种方式加入你的正功能在feature_names中使用任何其他符号而不是空格。例如,我建议强调。 现在,这一部分让您再次正gramms独立的话,我想:

' '.join(word_text) 

我觉得你有下划线这里来替代空间:

word_text.append(feature_names[j]) 

更改为此:

word_text.append(feature_names[j].replace(' ', '_')) 
+0

它没有工作。它用(_)替换所有单词而没有任何中断。 – VKB

+0

我编辑了我的答案。你有没有尝试过这样的事情? – CrazyElf

+0

谢谢你的作品。 – VKB