2012-09-30 56 views
3

对于每个因子级别,我需要提取除了当前数据的所有数据子集合聚合的值。例如,有几个科目在几天内完成反应时间任务,我需要计算所有科目和所有日期的平均反应时间,但不包括计算平均值的科目。目前,我这样做:对于除当前级别以外的所有级别的每个级别的因子聚合值

library(lme4) 
ddply(sleepstudy, .(Subject, Days), summarise, 
     avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject & 
        sleepstudy$Days == Days,"Reaction"]), .progress="text") 

它适用于小数据集,但对于大数据集可以很慢。有没有办法做得更快?

回答

3
#create big dataset 
n <- 1e4 
set.seed(1) 
sleepstudy <- data.frame(Reaction=rnorm(n),Subject=1:4,Days=sort(rep((1:(n/4)),4))) 


library(plyr) 
system.time(
    res <- ddply(sleepstudy, .(Subject, Days), summarise, 
       avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject & 
       sleepstudy$Days == Days,"Reaction"])) 
) 
#User  System  elapsed 
#6.532  0.013  6.556 

#use data.table for big datasets 
library(data.table) 

dt<- as.data.table(sleepstudy) 
system.time(
{dt[,avg_rt:=mean(Reaction),by=Days]; 
    dt[,n:=.N,by=Days]; 
    dt[,avg_rt:=(avg_rt*n-Reaction)/(n-1)]} 
) 
#User  System  elapsed 
#0.005  0.001  0.005 


#test if results are equal 
dt2 <- as.data.table(res) 
setkey(dt2,Subject,Days) 
setkey(dt,Subject,Days) 
all.equal(dt[,avg_rt],dt2[,avg_rt]) 
#[1] TRUE 

对于真正的大数据集的速度增益应该更明显。由于ddply太慢,我无法与较大的数据集进行比较。

+0

谢谢,它的工程很好。即使ddply使用相同的算法,data.table仍然更快。也有可能通过类似的方式控制每个Subject X Days组合的多个观察值:'dt [,sn:=。N,by = c(“Subject”,“Days”)]; DT [,s_avg_rt:= ifelse(SN == 1,反应,.SD [,总和(反应)]),通过= C( “主题”, “天”)]; dt [,avg_rt1:=(avg_rt * n-s_avg_rt)/(n-sn)];' –

+0

@Andrey Chetverikov我做了一个小改动,大大提高了性能。 – Roland

+0

@罗兰好了! –

0

也许这与lapplyaggregate更快:

do.call("rbind", (lapply(unique(sleepstudy$Subject), 
         function(x) 
          cbind(Subject = x, 
           aggregate(Reaction ~ Days, 
              subset(sleepstudy, Subject != x), 
              mean))))) 

更新:

我比较都与system.time命令和它出现在原来的比较慢。

library(lme4) 
library(plyr) 

system.time(
ddply(sleepstudy, .(Subject, Days), summarise, 
     avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject & 
        sleepstudy$Days == Days,"Reaction"])) 
) 

    # user system elapsed 
    # 0.17 0.00 0.22 

system.time(
do.call("rbind", (lapply(unique(sleepstudy$Subject), 
         function(x) 
          cbind(Subject = x, 
           aggregate(Reaction ~ Days, 
              subset(sleepstudy, Subject != x), 
              mean))))) 
) 


    # user system elapsed 
    # 0.12 0.00 0.12 
+0

对于小数据集,这比原来的效果更好,但对大数据集来说,原始数据仍然更好。 http://pastebin.com/Zb4CaJrN 对于184320行,其6.041s为原始和10.96s为lapply和聚合 –