0
我正在处理会议文件的大型数据集。我正计划对此数据集执行文本挖掘和主题建模。该数据集包含35栏5151篇论文的7栏信息(包括摘要)。文本挖掘中的矩阵控制
names(compen)
[1] "Year.the.Paper.was.Presented" "Paper.Title"
[3] "Paper.Abstract" "Author.Name"
[5] "Author.s.Organization" "Reviewing.Committee.s.Code"
[7] "Reviewing.Committee.s.Name"
dim(compen)
[1] 35451 7
这里是我的下面的文本挖掘代码(完美的作品)。
library(tm)
mydata.corpus <- Corpus(VectorSource(compen$Paper.Abstract))
mydata.corpus <- tm_map(mydata.corpus, tolower)
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
my_stopwords <- c(stopwords('german'),"the", "due", "are", "not", "for", "this", "and", "that", "there", "beyond", "time", "from", "been", "both", "than", "has","now", "until", "all", "use", "two", "based", "between", "can", "different", "each", "have", "however", "its", "level", "more", "most","new", "number","one","other", "paper", "pavement", "such", "their", "these", "used", "using", "were", "when", "which", "with")
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
mydata.corpus <- tm_map(mydata.corpus, removeNumbers)
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm
dim(mydata.dtm)
findFreqTerms(mydata.dtm, lowfreq=5000)
问题从这里开始。
term.freq <- rowSums(as.matrix(mydata.dtm))
Error: cannot allocate vector of size 7.7 Gb
In addition: Warning messages:
1: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
2: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
3: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
4: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
它肯定看起来像一个内存问题。我想知道是否有办法控制矩阵,这样记忆问题就不会上升。
你正在运行32位或64位R ?.使用'Sys.getenv(“R_ARCH”)'来查明。 – Borealis
@Borealis您的代码会为我生成一个空字符串。 – SabDeM
跨平台版本是:'.Machine $ sizeof.pointer'。输出值为8表示您正在运行64位。 – Borealis