根据文本语料库中出现的词汇列出单词，Scikit-Learn

我在scikit-learn中的某些文档中安装了CountVectorizer。我希望看到文本语料库中的所有术语及其相应的频率，以便选择停用词。例如根据文本语料库中出现的词汇列出单词，Scikit-Learn

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

是否有任何内置函数呢？

来源

2013-04-18 user1506145

如果cv是您CountVectorizer和X是矢量语料库，然后

zip(cv.get_feature_names(), 
    np.asarray(X.sum(axis=0)).ravel())

返回(term, frequency)对一个列表，该CountVectorizer提取的语料每个不同的术语。

（需要小asarray + ravel舞来解决一些怪癖在scipy.sparse。）

来源

2013-04-18 09:01:36

谢谢！但他们没有排序，但我设法这样做：对于排序的元组（occ_list，key = lambda idx：idx [1]）：print tuple [0] +''+ str（tuple [1]）。问题是字符不能打印出来。我已将编码设置为utf8。 – user1506145

您是否确定get_feature_names（）将根据术语频率矩阵中的索引对术语进行排序？我发现cv.get_feature_names（）和cv.vocabulary_.keys（）没有相同的顺序。 – user1506145

@ user1506145：'dict.keys'不保证任何顺序;这正是为什么'get_feature_names'存在。 –

没有内置。我发现这样做是基于一个更快的方法上Ando Saabas's answer：

from sklearn.feature_extraction.text import CountVectorizer 
texts = ["Hello world", "Python makes a better world"] 
vec = CountVectorizer().fit(texts) 
bag_of_words = vec.transform(texts) 
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] 
sorted(words_freq, key = lambda x: x[1], reverse=True)

输出

[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]

来源

2018-01-28 18:45:16

根据文本语料库中出现的词汇列出单词，Scikit-Learn

回答

相关问题