在问候NLTK为搭配提取的例子的使用,看看下面的指南:A how-to guide by nltk on collocations extraction
就BNC语料库读者而言,所有的信息都在文档中。
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')
#And say you wanted to extract all bigram collocations and
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.
list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(scored)
的输出将是这个样子:
[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699),
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894),
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]
如果你想用分数来排序,你可以尝试这样的事情
sorted_bigrams = sorted(bigram for bigram, score in scored)
print(sorted_bigrams)
由于:
[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'),
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'),
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]
你的目的是什么?你必须使用NLTK吗?我不太熟悉Python并且从不使用NLTK,但是我使用Stanford Core NLP在Java中处理了BNC。我的目标是建立一个正确的语料库来解析以获得单词对之间的依赖关系。所以,从BNC的xml文件开始,我用xml解析器重新创建了每个句子。然后我用Core NLP处理每个句子。 如果你的目标只是导入语料库,老实说我不能回应你,但在最后的例子中,你仍然可以创建XML文本的txt格式,并将其传递给python,并最终通过字符串处理它。 –
@ s.dallapalma你好。我不需要使用NLTK,但我确实需要能够使用某些库来查找单词的“搭配”。我看着斯坦福核心NLP,但被告知它没有一个Collocations功能。 – jason