加快R中的文本挖掘（和循环）

-1

我是文本挖掘数以千计的文档（基本上是做频率计数），并想知道是否有其他方法来加速以下过程？目前运行整个分析需要超过10个小时。谢谢（来自R初学者）。加快R中的文本挖掘（和循环）

sessionInfo() 
#R version 3.2.3 (2015-12-10) 

library(bitops) 
library(RCurl) 
library(XML) 
library(stringr) 
library(tm) 

setwd("F:/testing_folder") 
path = "F:/testing_folder" 

file.names <- dir(path, pattern =".txt") 
filename <- vector() 
totalword <- vector() 

system.time(
    for(i in 1:length(file.names)){ 
    text.v <- scan(file.names[i], what="character", sep="\n",encoding = "UTF-8") 
    report.v <- paste(text.v, collapse=" ") 

    #Count total number of words 
    words.l <- strsplit(report.v, "\\W") 
    word.v <- unlist(words.l) 
    not.blanks.v <- which(word.v!="") 
    word.v <- word.v[not.blanks.v] 
    totalword <- append(totalword,length(word.v)) 

    filename <- append(filename,print(file.names[i])) 
    x <- data.frame(filename,totalword) 
    write.csv(x, file= "results.csv") #export results 
    } 
)

来源

2016-02-14 kxiang

而不是'filename < - vector（）; totalword < - vector（）'你应该预先分配它们到正确的大小。这会给你一个明显的加速。另外，不要在循环的每次迭代中运行'write.csv' - 它会简化覆盖每次运行的结果，这需要时间并且没有多大意义 –

谢谢，但我不确定我完全理解你的意思是。比如说，如果我总共有10,300份文件，你能说得更具体吗？我应该怎么做？ – kxiang

你的问题不可重现，所以很难确切知道你在做什么。我所说的是一般性的评论，你不应该在一个循环中增长一个对象（相反，你应该预先分配它，查看'vector'），并且你只是在每个循环操作中覆盖了csv文件的结果，因此你应该简单地将它从循环中移除并写入之后 –

你从下面得到什么？

Rprof("profile1.out", line.profiling=TRUE) 
source("http://pastebin.com/raw/kFGCse5s") 
Rprof(NULL) 
proftable("profile1.out", lines=10)

来源

2016-02-14 19:30:55 geotheory

我使用随机的500个文件测试了我的代码（原始样本太大，需要太多时间才能运行），这里是'summaryRprof（“profile1.out”）的输出'http://pastebin.com/WnsTUYgr – kxiang

只需运行它1 .. – geotheory

你是什么意思“在1上运行”？ – kxiang

加快R中的文本挖掘（和循环）

回答

相关问题