将nltk.FreqDist单词分成两个列表？

我有一系列文本是自定义WebText类的实例。每个文本是具有等级（-10到+10）和与之相关联的字数（nltk.FreqDist）对象：包含的每一个字将nltk.FreqDist单词分成两个列表？

>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')] 
>>trainingTexts[1].rating 
10 
>>trainingTexts[1].freq_dist 
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>

你怎么能现在得到两个列表（或字典）（trainingText []。rating> 0）以及另一个包含在否定文本中专用的每个词的列表（trainingText []。rating < 0）。并有每个列表包含所有的正面或负面的文字总字数，让你得到的东西是这样的：

>>only_positive_words 
[('sky', 10), ('good', 9), ('great', 2)...] 
>>only_negative_words 
[('earth', 10), ('ski', 9), ('food', 2)...]

我考虑过使用集，集包含独特的情况下，但我看不出如何这可以用nltk.FreqDist来完成，并且最重要的是，一个集合不会按字频排序。有任何想法吗？

来源

2012-05-21 Zach

评分== 0的文本会发生什么情况？ – dhg

@dhg他们只是被忽略 – Zach

好吧，假设你开始这个测试的目的：

class Rated(object): 
    def __init__(self, rating, freq_dist): 
    self.rating = rating 
    self.freq_dist = freq_dist 

a = Rated(5, nltk.FreqDist('the boy sees the dog'.split())) 
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split())) 
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split())) 

trainingTexts = [a,b,c]

那么你的代码看起来像：

from collections import defaultdict 
from operator import itemgetter 

# dictionaries for keeping track of the counts 
pos_dict = defaultdict(int) 
neg_dict = defaultdict(int) 

for r in trainingTexts: 
    rating = r.rating 
    freq = r.freq_dist 

    # choose the appropriate counts dict 
    if rating > 0: 
    partition = pos_dict 
    elif rating < 0: 
    partition = neg_dict 
    else: 
    continue 

    # add the information to the correct counts dict 
    for word,count in freq.iteritems(): 
    partition[word] += count 

# Turn the counts dictionaries into lists of descending-frequency words 
def only_list(counts, filtered): 
    return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \ 
       key=itemgetter(1), \ 
       reverse=True) 

only_positive_words = only_list(pos_dict, neg_dict) 
only_negative_words = only_list(neg_dict, pos_dict)

，其结果是：

>>> only_positive_words 
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)] 
>>> only_negative_words 
[('nothing', 1), ('some', 1), ('likes', 1)]

来源

2012-05-21 15:39:49 dhg

但是，only_positive_words和only_negative_words中出现的单词不是唯一的。这两者之间有一些重叠。 – Zach

例如，如果“男孩”同时出现在肯定句和否定句中，它就不应出现在列表中 - 既不在only_positive_words也不only_negative_words。 – Zach

@Zach，更新了您的评论。 – dhg

将nltk.FreqDist单词分成两个列表？

回答

相关问题