将文档转换期限矩阵与大量数据的矩阵导致溢出

我站在这里与文件项矩阵（从tm包）

dtm <- TermDocumentMatrix(
    myCorpus, 
    control = list(
     weight = weightTfIdf, 
     tolower=TRUE, 
     removeNumbers = TRUE, 
     minWordLength = 2, 
     removePunctuation = TRUE, 
     stopwords=stopwords("german") 
    ))

当我做一个

typeof(dtm)

我看，这是一个“清单”和结构看起来像

Docs 
Terms  1 2 ... 
    lorem  0 0 ... 
    ipsum  0 0 ... 
    ...  .......

所以我尝试

wordMatrix = as.data.frame(t(as.matrix( dtm)))

这1000页的文档工作。

但是，当我尝试使用40000它不再。

我得到这个错误：矢量

Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein 
Zusätzlich: Warnmeldung: 
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt

错误...：矢量不能NA 附加：在由整数溢出创建

所以NR * NC来港我看着as.matrix，事实证明，该函数以某种方式将它转换为具有as.vector和矩阵的向量。转换为矢量可行，但不是从矢量转换为矩阵dosen't。

你有什么建议可能是什么问题？

感谢，队长

来源

2011-07-28 Captain Cook

一个简单的方法来获得下的内存限制您的DTM可能需要使用'tm :: removeSparseTerms'函数去除稀疏项 – Ben

避免首先包含非常罕见或单独出现的项的简单方法是使用DocumentTermMatrix（...，control（... bounds = list（global = c（N，Inf））））'并将N设置为eg 2,3,4 ...直到尺寸足够小。 – smci

整数溢出告诉你问题是什么：40000个文档，你有太多的数据。它是在转换到该问题顺便说一句开始一个矩阵，可如果你看底层函数的代码可以看出：

class(dtm) 
[1] "TermDocumentMatrix" "simple_triplet_matrix" 

getAnywhere(as.matrix.simple_triplet_matrix) 

A single object matching ‘as.matrix.simple_triplet_matrix’ was found 
... 
function (x, ...) 
{ 
    nr <- x$nrow 
    nc <- x$ncol 
    y <- matrix(vector(typeof(x$v), nr * nc), nr, nc) 
    ... 
}

这是错误消息中引用的行。这是怎么回事，可以很容易地模拟：

as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame 
[1] NA 
Warning message: 
NAs introduced by coercion

功能vector()需要一个参数，长度，在这种情况下nr*nc如果这是比APPX大。 2e9（.Machine$integer.max），它将被替换为NA。此NA不适用作为vector()的参数。底线：您正在跑入R的极限。至于现在，在64位工作并不会帮助您。你必须诉诸不同的方法。一种可能性是继续使用你拥有的列表（dtm是一个列表），使用列表操作选择你需要的数据并从那里开始。

PS：我做了一个DTM对象由

require(tm) 
data("crude") 
dtm <- TermDocumentMatrix(crude, 
          control = list(weighting = weightTfIdf, 
             stopwords = TRUE))

来源

2011-07-28 14:53:26

感谢您的澄清。我会尝试稀释dtm并希望能够执行转换。 –

这是一个非常非常简单的解决方案，我最近发现

DTM=t(TDM)#taking the transpose of Term-Document Matrix though not necessary but I prefer DTM over TDM 
M=as.big.matrix(x=as.matrix(DTM))#convert the DTM into a bigmemory object using the bigmemory package 
M=as.matrix(M)#convert the bigmemory object again to a regular matrix 
M=t(M)#take the transpose again to get TDM

请注意TDM的，服用转来获得DTM绝对是可选的，这是我个人喜欢用这种方式玩基体

PS 4年前我不能回答这个问题，因为我只是我大学的一个新入门课程

来源

2015-10-08 11:59:58

根据Joris Meys的回答，我找到了解决方案。 “向量（）”有关文件“长”的说法

... For a long vector, i.e., length > .Machine$integer.max, it has to be of type "double"...

因此，我们可以使as.matrix（）的一个微小的修正：

as.big.matrix <- function(x) { 
    nr <- x$nrow 
    nc <- x$ncol 
    # nr and nc are integers. 1 is double. Double * integer -> double 
    y <- matrix(vector(typeof(x$v), 1 * nr * nc), nr, nc) 
    y[cbind(x$i, x$j)] <- x$v 
    dimnames(y) <- x$dimnames 
    y 
}

来源

2016-08-19 10:35:28

将文档转换期限矩阵与大量数据的矩阵导致溢出

回答

相关问题