2014-02-10 159 views
0

我已经编写了一些代码来查找存储在位置路径中的文件中包含的单词的频率和文档频率。每个文件都通过函数cleanDoc()从文本文件中获取单词,并且我希望以表格方式提交术语频率,以便所有文档中的所有单词都应该被视为查找计数。任何人都可以告诉我应该如何实现它?我只使用NLTK。字词的频率和文档频率

import collections 
import os.path 
import glob 
import nltk 

wdict = set() 

path = "C://Python27//Corpus Files//*.*" 

#this function cleans up a doc (removes stopwords etc) 
def cleanDoc(doc): 
    stopset = set(nltk.corpus.stopwords.words('english')) 
    stemmer = nltk.PorterStemmer() 
    tokens = nltk.WordPunctTokenizer().tokenize(doc) 
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 3 and token.isalpha()] 
    final = [stemmer.stem(word) for word in clean] 
    return final 

for text in glob.glob(path): 

    f = open(text) 
    data= f.read() 
    words = cleanDoc(data) 
    wdict.update(words) 

回答

0

可以使用FreqDist对象,从nltk.probability算这些话。稍后,您可以使用类似字典的键值界面和方法(如freq.items()freq['word'])在其中导航,或者您甚至可以使用matplotlib来绘制结果。

import collections 
import os.path 
import glob 
import nltk 
from nltk.probability import FreqDist 


term_frequency = {} 

path = "C://Python27//Corpus Files//*.*" 

#this function cleans up a doc (removes stopwords etc) 
def cleanDoc(doc): 
    stopset = set(nltk.corpus.stopwords.words('english')) 
    stemmer = nltk.PorterStemmer() 
    tokens = nltk.WordPunctTokenizer().tokenize(doc) 
    clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 3 and token.isalpha()] 
    final = [stemmer.stem(word) for word in clean] 
    return final 

for text in glob.glob(path): 
    f = open(text) 
    data = f.read() 
    words = cleanDoc(data) 
    numbers_of_words = len(words) 
    freq = FreqDist(all_words) 
    # term_frequency is a dict which structure is like: 
    # { 
    #  'path_to_file': 
    #   {'term': 13.4, 'another_term': 15}, 
    #  'another_file': 
    #   {'term2': 12, 'foo': 15} 
    # } 
    for term in freq.keys(): 
     if isintance(term_frequency[text], dict): 
      term_frequency[text][term] = freq[term]/numbers_of_words 
     else: 
      term_frequency[text] = {term: freq[term]/numbers_of_words} 

参考:https://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

+0

它不考虑保存在路径中的所有文档中的所有单词。我只有几个学期的成绩。 – DummyGuy

+0

我的不好,现在要修复它。 –

+0

对不起,但它不起作用。我想要参考该文件来计算单词。 – DummyGuy