2016-07-29 47 views
0

我有一个语料库,其中有15,000多个文本文档。该removeSparseTerms功能不起作用:如何降低语料库中文本词矩阵的稀疏性(R)

dtm 

<<DocumentTermMatrix (documents: 15095, terms: 12811)>> 
Non-/sparse entries: 140286/193241759 
Sparsity   : 100% 
Maximal term length: 37 
Weighting   : term frequency (tf) 

dtms <- removeSparseTerms(dtm, 0.1) 
dtms 

<<DocumentTermMatrix (documents: 15095, terms: 0)>> 
Non-/sparse entries: 0/0 
Sparsity   : 100% 
Maximal term length: 0 
Weighting   : term frequency (tf) 

我也试过这样,它没有工作:

colTotals<- col_sums(dtm) 
dtm2 <- dtm[,which(colTotals>20)] 
dtm2 

<<DocumentTermMatrix (documents: 15095, terms: 1387)>> 
Non-/sparse entries: 100867/20835898 
Sparsity   : 100% 
Maximal term length: 26 
Weighting   : term frequency (tf) 

还有什么我能做的减少稀疏?我希望能够在excel中打开频率表,现在它需要太多的内存,所以我无法打开(这就是为什么我想减少稀疏性)。

回答