2015-02-23 163 views
4

我在R. 寻找一些简单的量化方法对我for循环加快程序我有一个句子和积极两个字典和否定词以下数据帧:向量化for循环中的R

# Create data.frame with sentences 
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop", 
         "wouldnt bad notebook", "very good quality", "orgtop", 
         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7), 
       stringsAsFactors=F) 

# Create pos/negWords 
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits", 
      "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great", 
      "wouldnt bad") 
negWords <- c("hate","bad","not good","horrible") 

现在我创建原始数据帧的重复,以模拟一个大的数据集:

# Replicate original data.frame - big data simulation (700.000 rows of sentences) 
df.expanded <- as.data.frame(replicate(100000,sent$words)) 
# library(zoo) 
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),] 
rownames(sent) <- NULL 

对于我的下一步计划,我将不得不做与他们本身的字典降字排序评分(正字= 1和负字= -1)。

# Ordering words in pos/negWords 
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F) 
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1)) 
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar)) 
wordsDF <- wordsDF[order(-wordsDF[,3]),] 
rownames(wordsDF) <- NULL 

然后我定义下列函数for循环:

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- length(grep(matchWords,sentence)) # count them 
    if(count){ 
     score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
     sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
     # library(qdapRegex) 
     sentence <- rm_white(sentence) 
    } 
    } 
    score 
} 

我呼吁句子前面的功能在我的数据帧:

# Apply scoreSentence function to sentences 
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
# Time consumption for 700.000 sentences in sent data.frame: 
# user  system elapsed 
# 1054.19 0.09  1056.17 
# Add sentiment score to origin sent data.frame 
sent <- cbind(sent, SentimentScore2) 

所需的输出是:

Words            user  SentimentScore2 
just right size and i love this notebook   1   2 
benefits great laptop        2   1 
wouldnt bad notebook        3   1 
very good quality         4   1 
orgtop           5   0 
    . 
    . 
    . 

所以f orth ...

请问,任何人都可以帮助我减少我原来的方法计算时间。由于我在R初学者的编程技巧,我最后:-) 任何您的帮助或建议将非常感激。非常感谢你提前。

+0

正如我从代码理解,你想删除检测到的单词,但期望的输出仍然有他们。那么哪部分是不正确的,还是我读错了? – LauriK 2015-02-23 09:52:09

+0

请详细解释您用SentimentScore2函数试图达到的效果 – StrikeR 2015-02-23 09:57:22

+0

删除单词是我的方法的一部分。降序排列正/负词中的单词后,将它们与句子中的单词相匹配,然后将它们删除,以使它们不出现在另一个循环中。期望的输出必须包含它们,但它需要很长时间,所以这是问题... – martinkabe 2015-02-23 10:00:13

回答

5

在“教人以鱼不如给鱼”的精神,我将带您通过:

  1. 使您的代码的副本:你会搞砸了!

  2. 查找瓶颈:

    1A:使问题更小:

    Rep <- 100 
    df.expanded <- as.data.frame(replicate(nRep,sent$words)) 
    library(zoo) 
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),] 
    

    1B:保持一个参考的解决方案:你会改变你的代码,并在引入很少有活动惊人错误比优化代码!

    sentRef <- sent 
    

    并添加相同的内容,但在代码末尾注释掉,以便记住您的引用的位置。为了使它更容易检查你是不是搞乱你的代码,你可以在你的代码的末尾自动测试:

    library("testthat") 
    expect_equal(sent,sentRef) 
    

    1C:触发代码周围探查一下:

    Rprof() 
    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
    Rprof(NULL) 
    

    1D:查看结果,基R:

    summaryRprof() 
    

    也有更好的工具,可以检查包 探查 或 lineprof

    lineprof 是我的首选工具,这里真正的附加值,从而来缩小问题这两条线:

    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- length(grep(matchWords,sentence)) # count them 
    
  3. 修复它。

    3.1幸运的是,主要的问题是相当简单的:你不需要第一行在函数中,先移动它。顺便也适用于你的paste0()。您的代码变成:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 
    
    # Sentiment score function 
    scoreSentence2 <- function(sentence){ 
        score <- 0 
        for(x in 1:nrow(wordsDF)){ 
         count <- length(grep(matchWords[x],sentence)) # count them 
         if(count){ 
          score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
          sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
          require(qdapRegex) 
          # sentence <- rm_white(sentence) 
         } 
        } 
        score 
    } 
    

    这改变了1000名代表到2.32s的执行时间从
    5.64s。不是一个糟糕的投资!

    3.2下布特尔颈部是“数< - ”行,但我认为 阴影刚刚正确的答案:-)结合我们得到:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 
    
    # Sentiment score function 
    scoreSentence2 <- function(sentence){ 
        score <- 0 
        for(x in 1:nrow(wordsDF)){ 
         count <- grepl(matchWords[x],sentence) # count them 
         score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
         sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
         require(qdapRegex) 
         # sentence <- rm_white(sentence) 
        } 
        score 
    } 
    

这里,使0.18s或31倍...

+0

太棒了!非常感谢你,先生,你帮了我很多。你的方法是我的任务的最佳解决方案。 – martinkabe 2015-02-23 13:51:40

1

您可以轻松地矢量化你的scoreSentence2功能,因为grepgrepl已经矢量化:

scoreSentence <- function(sentence){ 
    score <- rep(0, length(sentence)) 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- grepl(matchWords, sentence) # count them 
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
    sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
    sentence <- rm_white(sentence) 
    } 
    return(score) 
} 
scoreSentence(sent$words) 

注泰德的count实际上不计次数的表达式出现在一个句子(既不在你也不在我的版本中)。它只是告诉你表达式是否出现。如果你想实际计算它们,你可以使用下面的代码。

count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0])) 
+0

非常好,非常感谢,现在它快得多(3倍)。 – martinkabe 2015-02-23 12:38:03