2015-10-17 44 views
0

我使用R中的tm包进行一些文本挖掘。我有一个术语频率矩阵,其中每一行都是一个文档,每一列都是一个单词,每个单元都是这个单词的频率。我试图将其转换为DocumentTermTermMatrix对象。我似乎无法找到处理该问题的功能。看起来来源通常是文件。tm中的DocumentTermMatrix的Term频率表R包

我试过as.DocumentTermTermMatrix()但它要求一个说法“加权”给了以下错误:

Error in .TermDocumentMatrix(t(x), weighting) :
argument "weighting" is missing, with no default

这里是代码的简单重复的例子,

docs = matrix(sample(1:10, 50, replace=T), byrow = TRUE, ncol = 5, nrow=10) 
rownames(docs) = paste0("doc", 1:10) 
colnames(docs) = c("grad", "school", "is", "sleep", "deprivation") 

所以我需要将矩阵文档强制转换为DocumentTermMatrix

回答

0

使用您的代码示例,您可以使用以下命令:

docs = matrix(sample(1:10, 50, replace=T), byrow = TRUE, ncol = 5, nrow=10) 
rownames(docs) = paste0("doc", 1:10) 
colnames(docs) = c("grad", "school", "is", "sleep", "deprivation") 

dtm <- as.DocumentTermMatrix(docs, weighting = weightTfIdf) 

如果你读帮助DocumentTermMatrix你看到下面的参数

weighting: A weighting function capable of handling a TermDocumentMatrix. It defaults to weightTf for term frequency weighting. Available weighting functions shipped with the tm package are weightTf, weightTfIdf, weightBin, and weightSMART.

根据您的需要选择以下你必须指定加权公式用于文档术语矩阵。或者自己创建一个。