向量化for循环中的R

我在R. 寻找一些简单的量化方法对我for循环加快程序我有一个句子和积极两个字典和否定词以下数据帧：向量化for循环中的R

# Create data.frame with sentences 
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop", 
         "wouldnt bad notebook", "very good quality", "orgtop", 
         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7), 
       stringsAsFactors=F) 

# Create pos/negWords 
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits", 
      "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great", 
      "wouldnt bad") 
negWords <- c("hate","bad","not good","horrible")

现在我创建原始数据帧的重复，以模拟一个大的数据集：

# Replicate original data.frame - big data simulation (700.000 rows of sentences) 
df.expanded <- as.data.frame(replicate(100000,sent$words)) 
# library(zoo) 
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),] 
rownames(sent) <- NULL

对于我的下一步计划，我将不得不做与他们本身的字典降字排序评分（正字= 1和负字= -1）。

# Ordering words in pos/negWords 
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F) 
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1)) 
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar)) 
wordsDF <- wordsDF[order(-wordsDF[,3]),] 
rownames(wordsDF) <- NULL

然后我定义下列函数for循环：

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- length(grep(matchWords,sentence)) # count them 
    if(count){ 
     score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
     sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
     # library(qdapRegex) 
     sentence <- rm_white(sentence) 
    } 
    } 
    score 
}

我呼吁句子前面的功能在我的数据帧：

# Apply scoreSentence function to sentences 
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
# Time consumption for 700.000 sentences in sent data.frame: 
# user  system elapsed 
# 1054.19 0.09  1056.17 
# Add sentiment score to origin sent data.frame 
sent <- cbind(sent, SentimentScore2)

所需的输出是：

Words            user  SentimentScore2 
just right size and i love this notebook   1   2 
benefits great laptop        2   1 
wouldnt bad notebook        3   1 
very good quality         4   1 
orgtop           5   0 
    . 
    . 
    .

所以f orth ...

请问，任何人都可以帮助我减少我原来的方法计算时间。由于我在R初学者的编程技巧，我最后:-) 任何您的帮助或建议将非常感激。非常感谢你提前。

来源

2015-02-23 martinkabe

正如我从代码理解，你想删除检测到的单词，但期望的输出仍然有他们。那么哪部分是不正确的，还是我读错了？ – LauriK 2015-02-23 09:52:09

请详细解释您用SentimentScore2函数试图达到的效果 – StrikeR 2015-02-23 09:57:22

删除单词是我的方法的一部分。降序排列正/负词中的单词后，将它们与句子中的单词相匹配，然后将它们删除，以使它们不出现在另一个循环中。期望的输出必须包含它们，但它需要很长时间，所以这是问题... – martinkabe 2015-02-23 10:00:13

在“教人以鱼不如给鱼”的精神，我将带您通过：

使您的代码的副本：你会搞砸了！
查找瓶颈：

1A：使问题更小：
```
Rep <- 100 
df.expanded <- as.data.frame(replicate(nRep,sent$words)) 
library(zoo) 
sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),] 
```
1B：保持一个参考的解决方案：你会改变你的代码，并在引入很少有活动惊人错误比优化代码！
```
sentRef <- sent 
```
并添加相同的内容，但在代码末尾注释掉，以便记住您的引用的位置。为了使它更容易检查你是不是搞乱你的代码，你可以在你的代码的末尾自动测试：
```
library("testthat") 
expect_equal(sent,sentRef) 
```
1C：触发代码周围探查一下：
```
Rprof() 
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
Rprof(NULL) 
```
1D：查看结果，基R：
```
summaryRprof() 
```
也有更好的工具，可以检查包探查或 lineprof

lineprof 是我的首选工具，这里真正的附加值，从而来缩小问题这两条线：
```
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
count <- length(grep(matchWords,sentence)) # count them 
```

修复它。

3.1幸运的是，主要的问题是相当简单的：你不需要第一行在函数中，先移动它。顺便也适用于你的paste0（）。您的代码变成：

matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
     count <- length(grep(matchWords[x],sentence)) # count them 
     if(count){ 
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
      sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
      require(qdapRegex) 
      # sentence <- rm_white(sentence) 
     } 
    } 
    score 
}

这改变了1000名代表到2.32s的执行时间从
5.64s。不是一个糟糕的投资！

3.2下布特尔颈部是“数< - ”行，但我认为阴影刚刚正确的答案:-)结合我们得到：

matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
     count <- grepl(matchWords[x],sentence) # count them 
     score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
     sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
     require(qdapRegex) 
     # sentence <- rm_white(sentence) 
    } 
    score 
}

这里，使0.18s或31倍...

来源

2015-02-23 12:59:42 cmbarbu

太棒了！非常感谢你，先生，你帮了我很多。你的方法是我的任务的最佳解决方案。 – martinkabe 2015-02-23 13:51:40

您可以轻松地矢量化你的scoreSentence2功能，因为grep，grepl已经矢量化：

scoreSentence <- function(sentence){ 
    score <- rep(0, length(sentence)) 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- grepl(matchWords, sentence) # count them 
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
    sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
    sentence <- rm_white(sentence) 
    } 
    return(score) 
} 
scoreSentence(sent$words)

注泰德的count实际上不计次数的表达式出现在一个句子（既不在你也不在我的版本中）。它只是告诉你表达式是否出现。如果你想实际计算它们，你可以使用下面的代码。

count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))

来源

2015-02-23 10:35:08 shadow

非常好，非常感谢，现在它快得多（3倍）。 – martinkabe 2015-02-23 12:38:03

向量化for循环中的R

回答

相关问题