2014-09-25 39 views
0

我有一个是这样的代码:过滤NLTK两字组频率(Python3,NLTK)

df1 = df[['term']] 
df2 = df1.to_string() 
words = nltk.word_tokenize(df2) 
bgs = nltk.bigrams(words) 
fdist = nltk.FreqDist(bgs) 

我现在该如何过滤FDIST只发现那些出现2倍以上的双字母组?

回答

0

这是我做的,我的目的(不是最直接的,但我想我想补充我的两分钱):将数据放入一个新的数据帧,在数据帧

frequencies = [[" ".join(k),v] for k,v in fdist.items()] 
frame = pd.DataFrame(frequencies, columns=['Bigrams','Frequency']) 
removal = frame[frame['Frequency'] >= 10] 
0

尝试过滤...

for obj in fdist.most_common(): 
    if obj[1] >2: 
     print(obj) 

OR

for obj in fdist: 
    if fdist1[obj] >2: 
     print(obj, fdist1[obj])