2012-05-21 123 views
2

我有一系列文本是自定义WebText类的实例。每个文本是具有等级(-10到+10)和与之相关联的字数(nltk.FreqDist)对象:包含的每一个字将nltk.FreqDist单词分成两个列表?

>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')] 
>>trainingTexts[1].rating 
10 
>>trainingTexts[1].freq_dist 
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...> 

你怎么能现在得到两个列表(或字典) (trainingText []。rating> 0)以及另一个包含在否定文本中专用的每个词的列表(trainingText []。rating < 0)。并有每个列表包含所有的正面或负面的文字总字数,让你得到的东西是这样的:

>>only_positive_words 
[('sky', 10), ('good', 9), ('great', 2)...] 
>>only_negative_words 
[('earth', 10), ('ski', 9), ('food', 2)...] 

我考虑过使用集,集包含独特的情况下,但我看不出如何这可以用nltk.FreqDist来完成,并且最重要的是,一个集合不会按字频排序。有任何想法吗?

+0

评分== 0的文本会发生什么情况? – dhg

+0

@dhg他们只是被忽略 – Zach

回答

2

好吧,假设你开始这个测试的目的:

class Rated(object): 
    def __init__(self, rating, freq_dist): 
    self.rating = rating 
    self.freq_dist = freq_dist 

a = Rated(5, nltk.FreqDist('the boy sees the dog'.split())) 
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split())) 
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split())) 

trainingTexts = [a,b,c] 

那么你的代码看起来像:

from collections import defaultdict 
from operator import itemgetter 

# dictionaries for keeping track of the counts 
pos_dict = defaultdict(int) 
neg_dict = defaultdict(int) 

for r in trainingTexts: 
    rating = r.rating 
    freq = r.freq_dist 

    # choose the appropriate counts dict 
    if rating > 0: 
    partition = pos_dict 
    elif rating < 0: 
    partition = neg_dict 
    else: 
    continue 

    # add the information to the correct counts dict 
    for word,count in freq.iteritems(): 
    partition[word] += count 

# Turn the counts dictionaries into lists of descending-frequency words 
def only_list(counts, filtered): 
    return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \ 
       key=itemgetter(1), \ 
       reverse=True) 

only_positive_words = only_list(pos_dict, neg_dict) 
only_negative_words = only_list(neg_dict, pos_dict) 

,其结果是:

>>> only_positive_words 
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)] 
>>> only_negative_words 
[('nothing', 1), ('some', 1), ('likes', 1)] 
+0

但是,only_positive_words和only_negative_words中出现的单词不是唯一的。这两者之间有一些重叠。 – Zach

+0

例如,如果“男孩”同时出现在肯定句和否定句中,它就不应出现在列表中 - 既不在only_positive_words也不only_negative_words。 – Zach

+0

@Zach,更新了您的评论。 – dhg