如果你想使用NLTK的功能,您可以使用NLTK的ConcordanceIndex
。为了立足显示器的宽度上的字的数量,而不是字符数(后者是为ConcordanceIndex.print_concordance
默认值),则可以仅仅创建的ConcordanceIndex
一个子类与像这样:
from nltk import ConcordanceIndex
class ConcordanceIndex2(ConcordanceIndex):
def create_concordance(self, word, token_width=13):
"Returns a list of contexts for @word with a context <= @token_width"
half_width = token_width // 2
contexts = []
for i, token in enumerate(self._tokens):
if token == word:
start = i - half_width if i >= half_width else 0
context = self._tokens[start:i + half_width + 1]
contexts.append(context)
return contexts
然后你就可以得到这样的结果:
>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.' # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley') # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]
的create_concordance
方法我在上面创建是基于NLTK的ConcordanceIndex.print_concordance
方法,它是这样工作的:
>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
valley , whereas the giraffe merely turn
and clumsily loped away from the valley into the nearby ravine .
谢谢,这看起来更NLTK-ESK,但仍然是逻辑手工制作。我希望实现,测试和最重要的东西:在框架范围内进行了优化。 – Zakum