2014-03-01 33 views
3

我有一个语料库,我有一个词。对于语料库中每个单词的出现,我想得到一个包含前面的k个单词和单词后面的k个单词的列表。我正在做这个算法确定(见下文),但我想知道NLTK是否提供了一些功能来满足我错过的需求?NLTK:查找大小为2k的上下文

def sized_context(word_index, window_radius, corpus): 
    """ Returns a list containing the window_size amount of words to the left 
    and to the right of word_index, not including the word at word_index. 
    """ 

    max_length = len(corpus) 

    left_border = word_index - window_radius 
    left_border = 0 if word_index - window_radius < 0 else left_border 

    right_border = word_index + 1 + window_radius 
    right_border = max_length if right_border > max_length else right_border 

    return corpus[left_border:word_index] + corpus[word_index+1: right_border] 

回答

3

最简单的方法是使用nltk.ngrams()

words = nltk.corpus.brown.words() 
k = 5 
for ngram in nltk.ngrams(words, 2*k+1, pad_left=True, pad_right=True, pad_symbol=" "): 
    if ngram[k+1].lower() == "settle": 
     print(" ".join(ngram)) 

pad_leftpad_right确保所有的话得到看着。如果你不让你的语调和谐跨越句子(因此:大量的边界案例),这一点很重要。

如果你想忽略窗口大小标点符号,可以在扫描之前剥离其:

words = (w for w in nltk.corpus.brown.words() if re.search(r"\w", w)) 
6

如果你想使用NLTK的功能,您可以使用NLTK的ConcordanceIndex。为了立足显示器的宽度上的字的数量,而不是字符数(后者是为ConcordanceIndex.print_concordance默认值),则可以仅仅创建的ConcordanceIndex一个子类与像这样:

from nltk import ConcordanceIndex 

class ConcordanceIndex2(ConcordanceIndex): 
    def create_concordance(self, word, token_width=13): 
     "Returns a list of contexts for @word with a context <= @token_width" 
     half_width = token_width // 2 
     contexts = [] 
     for i, token in enumerate(self._tokens): 
      if token == word: 
       start = i - half_width if i >= half_width else 0 
       context = self._tokens[start:i + half_width + 1] 
       contexts.append(context) 
     return contexts 

然后你就可以得到这样的结果:

>>> from nltk.tokenize import wordpunct_tokenize 
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.' # my corpus 
>>> tokens = wordpunct_tokenize(my_corpus) 
>>> c = ConcordanceIndex2(tokens) 
>>> c.create_concordance('valley') # returns a list of lists, since words may occur more than once in a corpus 
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']] 

create_concordance方法我在上面创建是基于NLTK的ConcordanceIndex.print_concordance方法,它是这样工作的:

>>> c = ConcordanceIndex(tokens) 
>>> c.print_concordance('valley') 
Displaying 2 of 2 matches: 
            valley , whereas the giraffe merely turn 
and clumsily loped away from the valley into the nearby ravine . 
+0

谢谢,这看起来更NLTK-ESK,但仍然是逻辑手工制作。我希望实现,测试和最重要的东西:在框架范围内进行了优化。 – Zakum