在一个语料库的每个文档中查找最频繁的词条

我一直在使用R的tm包，在分类问题上取得了很大的成功。我知道如何在整个语料库中找到最频繁的词条（使用findFreqTerms()），但在文档中没有看到任何可以找到最频繁词语的词语（在我删除了停用词之后，但在删除稀疏词语之前）在文集中的每个单独文档中。我试过使用apply()和max命令，但是这给了我每个文档中术语发生的最大次数，而不是术语本身的名称。在一个语料库的每个文档中查找最频繁的词条

library(tm) 

data("crude") 
corpus<-tm_map(crude, removePunctuation) 
corpus<-tm_map(corpus, stripWhitespace) 
corpus<-tm_map(corpus, tolower) 
corpus<-tm_map(corpus, removeWords, stopwords("English")) 
corpus<-tm_map(corpus, stemDocument) 
dtm <- DocumentTermMatrix(corpus) 
maxterms<-apply(dtm, 1, max) 
maxterms 
127 144 191 194 211 236 237 242 246 248 273 349 352 
5 13 2 3 3 10 8 3 7 9 9 4 5 
353 368 489 502 543 704 708 
4 4 4 5 5 9 4

想法？

来源

2013-11-04 Bryan

本的答案给出了你所要求的，但我不确定你要求的是明智的。它没有考虑到关系。这是一种方法，第二种方法是使用the qdap package。他们会给你带有单词的列表（在qdap的情况下是带有单词和频率的数据框列表），你可以使用unlist为第一个选项和lapply分配剩余部分，索引为unlist，qdap为qdap。工程对原Corpus：

选项＃1：

apply(dtm, 1, function(x) unlist(dtm[["dimnames"]][2], 
    use.names = FALSE)[x == max(x)])

选项＃2 qdap：

library(qdap) 
dat <- tm_corpus2df(crude) 
tapply(stemmer(dat$text), dat$docs, freq_terms, top = 1, 
    stopwords = tm::stopwords("English"))

用lapply(WRAP_HERE, "[", 1)对tapply进行包装使得两个答案在内容和接近格式上完全相同。

编辑：新增一个例子是一个精简的使用qdap的：

FUN <- function(x) freq_terms(x, top = 1, stopwords = stopwords("English"))[, 1] 
lapply(stemmer(crude), FUN) 

## [[1]] 
## [1] "oil" "price" 
## 
## [[2]] 
## [1] "opec" 
## 
## [[3]] 
## [1] "canada" "canadian" "crude" "oil"  "post"  "price" "texaco" 
## 
## [[4]] 
## [1] "crude" 
## 
## [[5]] 
## [1] "estim" "reserv" "said" "trust" 
## 
## [[6]] 
## [1] "kuwait" "said" 
## 
## [[7]] 
## [1] "report" "say" 
## 
## [[8]] 
## [1] "yesterday" 
## 
## [[9]] 
## [1] "billion" 
## 
## [[10]] 
## [1] "market" "price" 
## 
## [[11]] 
## [1] "mln" 
## 
## [[12]] 
## [1] "oil" 
## 
## [[13]] 
## [1] "oil" "price" 
## 
## [[14]] 
## [1] "oil" "opec" 
## 
## [[15]] 
## [1] "power" 
## 
## [[16]] 
## [1] "oil" 
## 
## [[17]] 
## [1] "oil" 
## 
## [[18]] 
## [1] "dlrs" 
## 
## [[19]] 
## [1] "futur" 
## 
## [[20]] 
## [1] "januari"

来源

2013-11-04 04:15:26

关于关系的好处，很对。 – Ben

同意。本，如果你不介意的话，我正在把这个接受的答案提出来。 – Bryan

几乎在那里，用which.max替换max以获得每个文档（即每行）具有最高频率的词索引索引。然后使用该列索引向量对文档术语矩阵中的术语（或列名称，种类）进行子集分类。这将返回每个具有该文档最大频率的文档的实际术语（而不仅仅是频率值，与使用max时的频率值一样）。所以，从你的例子下面

maxterms<-apply(dtm, 1, which.max) 
dtm$dimnames$Terms[maxterms] 
[1] "oil"  "opec" "canada" "crude" "said" "said" "report" "oil"  
[9] "billion" "oil"  "mln"  "oil"  "oil"  "oil"  "power" "oil"  
[17] "oil"  "dlrs" "futures" "january"

来源

2013-11-04 02:55:58 Ben

真棒！非常感谢。 – Bryan

在一个语料库的每个文档中查找最频繁的词条

回答

相关问题