2015-07-13 34 views
0

我正在处理会议文件的大型数据集。我正计划对此数据集执行文本挖掘和主题建模。该数据集包含35栏5151篇论文的7栏信息(包括摘要)。文本挖掘中的矩阵控制

names(compen) 
[1] "Year.the.Paper.was.Presented" "Paper.Title"     
[3] "Paper.Abstract"    "Author.Name"     
[5] "Author.s.Organization"  "Reviewing.Committee.s.Code" 
[7] "Reviewing.Committee.s.Name" 
dim(compen) 
[1] 35451  7 

这里是我的下面的文本挖掘代码(完美的作品)。

library(tm) 
mydata.corpus <- Corpus(VectorSource(compen$Paper.Abstract)) 
mydata.corpus <- tm_map(mydata.corpus, tolower) 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE) 
my_stopwords <- c(stopwords('german'),"the", "due", "are", "not", "for", "this", "and", "that", "there", "beyond", "time", "from", "been", "both", "than", "has","now", "until", "all", "use", "two", "based", "between", "can", "different", "each", "have", "however", "its", "level", "more", "most","new", "number","one","other", "paper", "pavement", "such", "their", "these", "used", "using", "were", "when", "which", "with") 
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) 
mydata.corpus <- tm_map(mydata.corpus, removeNumbers) 
mydata.dtm <- TermDocumentMatrix(mydata.corpus) 
mydata.dtm 
dim(mydata.dtm) 
findFreqTerms(mydata.dtm, lowfreq=5000) 

问题从这里开始。

term.freq <- rowSums(as.matrix(mydata.dtm)) 
Error: cannot allocate vector of size 7.7 Gb 
In addition: Warning messages: 
1: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 
2: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 
3: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 
4: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 

它肯定看起来像一个内存问题。我想知道是否有办法控制矩阵,这样记忆问题就不会上升。

+0

你正在运行32位或64位R ?.使用'Sys.getenv(“R_ARCH”)'来查明。 – Borealis

+0

@Borealis您的代码会为我生成一个空字符串。 – SabDeM

+1

跨平台版本是:'.Machine $ sizeof.pointer'。输出值为8表示您正在运行64位。 – Borealis

回答

0

这并不是很多数据,但它听起来像加载它的方式在8GB系统上内存不足。但请试试这个:

require(quanteda) 
mydata.corpus <- corpus(compen$Paper.Abstract, 
         dovcars = compen[-which(names(compen)=="Paper.Abstract")]) 
mydata.dtm <- dfm(mydata.corpus, ignoredFeatures = my_stopwords) 
mydata.dtm 
topfeatures(mydata.dfm, 5000) 

目前它不保留字内连字符,但我们很可能很快会添加它作为选项。如果您想为您的问题使用quanteda,我很乐意为您提供进一步的帮助。它适用于文档级元数据(“docvars”),可以直接将“dfm”传递给所有主要主题建模包 - 请参阅help(convert, package = "quanteda")