2014-01-16 56 views
8

我知道如何使用NLTK获取bigram和trigram搭配,并将它们应用于我自己的语料库。代码如下。特定词的NLTK搭配

但我不确定(1)如何获得特定单词的搭配? (2)NLTK是否具有基于对数似然比的搭配度量?

import nltk 
from nltk.collocations import * 
from nltk.tokenize import word_tokenize 

text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" 

trigram_measures = nltk.collocations.TrigramAssocMeasures() 
finder = TrigramCollocationFinder.from_words(word_tokenize(text)) 

for i in finder.score_ngrams(trigram_measures.pmi): 
    print i 

回答

9

试试这个代码:

import nltk 
from nltk.collocations import * 
bigram_measures = nltk.collocations.BigramAssocMeasures() 
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# Ngrams with 'creature' as a member 
creature_filter = lambda *w: 'creature' not in w 


## Bigrams 
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 
# only bigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# only bigrams that contain 'creature' 
finder.apply_ngram_filter(creature_filter) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(bigram_measures.likelihood_ratio, 10) 


## Trigrams 
finder = TrigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 
# only trigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# only trigrams that contain 'creature' 
finder.apply_ngram_filter(creature_filter) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(trigram_measures.likelihood_ratio, 10) 

它使用的可能性的措施,并筛选出不包含这个词“生物”

的n-gram
2

问题1 - 尝试:

target_word = "electronic" # your choice of word 
finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3)) 
for i in finder.score_ngrams(trigram_measures.likelihood_ratio): 
print i 

的想法是过滤掉你不想要的。这种方法通常用于过滤ngram中特定部分的单词,并且可以根据您的内容调整它。