如何从文本文档中找到常用短语

我有一个包含大量评论/句子的文本文件，我想以某种方式找到文档中重复出现的最常见短语。我试图与它摆弄了一下与NLTK，我发现这个线程：How to extract common/significant phrases from a series of text entries 如何从文本文档中找到常用短语

然而，尝试之后，我得到奇怪的结果像这样：

>>> finder.apply_freq_filter(3) 
>>> finder.nbest(bigram_measures.pmi, 10) 
[('m', 'e'), ('t', 's')]

而在另一个文件，其中那句“这是有趣“是非常普遍的，我得到一个空的列表[]。

我应该怎么做呢？

这里是我的全码：

import nltk 
from nltk.collocations import * 
bigram_measures = nltk.collocations.BigramAssocMeasures() 
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# change this to read in your data 
finder = BigramCollocationFinder.from_words('MkXVM6ad9nI.txt') 

# only bigrams that appear 3+ times 
finder.apply_freq_filter(3) 

# return the 10 n-grams with the highest PMI 
print finder.nbest(bigram_measures.pmi, 10)

来源

2014-04-22 Stupid.Fat.Cat

我没有用过nltk，但我怀疑的问题是，from_words接受字符串或标记物（？）。

一种近乎

with open('MkXVM6ad9nI.txt') as wordfile: 
    text = wordfile.read) 

tokens = nltk.wordpunct_tokenize(text) 
finder = BigramCollocationFinder.from_words(tokens)

可能的工作，虽然可能有对文件的专用API了。

来源

2014-04-22 20:17:56 Veedrac

谢谢！这是我的问题。 –

如何从文本文档中找到常用短语

回答

相关问题