2013-11-24 28 views
1

我想重新编写下面稍微复杂的plyr命令,以便它更快,并使用聚合,tapply或data.table。用集合重写复杂的ddply

该功能允许您输入多个ID变量并测量变量,然后返回几个计算结果。但是,在更大的数据集上,它可能不是最有效的。


这里是代码...

require(ggplot2) # to get the diamonds data set 
require(plyr) 

mean_sd_for_several_variables <- function(df, lvls, measures) { 
    res <- ddply(df, lvls, function(x) { 
       ret <- vector() 
       for(measure in measures) { 
        mean_sd <- c(mean(x[,measure]), sd(x[,measure])) 
        names(mean_sd) <- c(paste0("mean_", measure), paste0("sd_", measure)) 
        ret <- c(ret, mean_sd) 
       } 
       return(ret) 
       } 
) 

    print(res) 
} 

...返回:

mean_sd_for_several_variables(diamonds, c("color", "cut"), c("price","depth")) 

    color  cut mean_price sd_price mean_depth sd_depth 
1  D  Fair  4291.1 3286.1  64.048 3.29220 
2  D  Good  3405.4 3175.1  62.366 2.22240 
3  D Very Good  3470.5 3523.8  61.750 1.46223 
4  D Premium  3631.3 3711.6  61.169 1.15806 
5  D  Ideal  2629.1 3001.1  61.678 0.71201 
6  E  Fair  3682.3 2976.7  63.320 4.42103 
7  E  Good  3423.6 3330.7  62.204 2.23059 
8  E Very Good  3214.7 3408.0  61.730 1.42377 
9  E Premium  3538.9 3795.0  61.176 1.16454 
10  E  Ideal  2597.6 2956.0  61.687 0.70718 
11  F  Fair  3827.0 3223.3  63.508 3.70209 
12  F  Good  3495.8 3202.4  62.202 2.23976 
13  F Very Good  3778.8 3786.1  61.722 1.38939 
14  F Premium  4324.9 4012.0  61.260 1.16775 
15  F  Ideal  3374.9 3766.6  61.676 0.69398 
16  G  Fair  4239.3 3609.6  64.340 3.57340 
17  G  Good  4123.5 3702.5  62.527 2.03893 
18  G Very Good  3872.8 3861.4  61.841 1.33169 
19  G Premium  4500.7 4356.6  61.279 1.15341 
20  G  Ideal  3720.7 4006.3  61.700 0.68714 
21  H  Fair  5135.7 3886.5  64.585 3.14173 
22  H  Good  4276.3 4020.7  62.500 2.09212 
23  H Very Good  4535.4 4185.8  61.968 1.31895 
24  H Premium  5216.7 4466.2  61.322 1.15164 
25  H  Ideal  3889.3 4013.4  61.733 0.72939 
26  I  Fair  4685.4 3730.3  64.221 3.68771 
27  I  Good  5078.5 4631.7  62.475 2.17958 
28  I Very Good  5255.9 4687.1  61.935 1.32890 
29  I Premium  5946.2 5053.7  61.329 1.15338 
30  I  Ideal  4452.0 4505.2  61.794 0.72334 
31  J  Fair  4975.7 4050.5  64.357 3.31595 
32  J  Good  4574.2 3707.8  62.396 2.12091 
33  J Very Good  5103.5 4135.7  61.902 1.33679 
34  J Premium  6294.6 4788.9  61.390 1.13989 
35  J  Ideal  4918.2 4476.2  61.822 0.94669 
+0

嗯,没有太多的时间正确的认识,只是觉得我做了我的gateveys包类似的东西。 'calcShares'函数在这里:https://github.com/mbannert/gateveys/blob/master/R/gateveys.R以类似的方式使用data.table。不知道这是否对你更直观,但至少是data.table。 –

回答

4

使用aggregate

> result <- aggregate(cbind(price, depth) ~ color+cut, 
        FUN=function(x) c(mean=mean(x), sd=sd(x)), 
        data=diamonds) 
> do.call(data.frame, result) 
    color  cut price.mean price.sd depth.mean depth.sd 
1  D  Fair 4291.061 3286.114 64.0484663 3.2921972 
2  E  Fair 3682.312 2976.652 63.3196429 4.4210329 
3  F  Fair 3827.003 3223.303 63.5080128 3.7020938 
4  G  Fair 4239.255 3609.644 64.3398089 3.5733985 
5  H  Fair 5135.683 3886.482 64.5851485 3.1417311 
+0

嗯..我得到以下错误:错误在FUN(X [[1L]],...):数组元素不匹配 –

+0

再试一次,我没有得到任何错误。 –

+0

抱歉。这是一个名为memisc的软件包。哎呀! –

4

这里是data.table解决

mean_and_sd <- function(.SD){ 
    x1 = lapply(.SD, mean) 
    x2 = lapply(.SD, sd) 
    cbind(x1, x2) 
} 

library(data.table) 
DT = data.table(diamonds) 
DT[, mean_and_sd(.SD), by = c("cut", "color"), .SDcols = c("price", "carat")] 

你可以抛出此变成一个接受d的函数预期的输入并返回适当的数据帧。

+0

请注意,'.SD'针对'j'中的lapply(.SD,...)进行了优化,因此随着'by'中组的数量的增加,这将会变慢。我可以想到的另一种方式是:DT [,unlist(lapply(.SD,function(x)list(mean = mean(x),sd = sd(x))),rec = FALSE)by = list (剪切,颜色),.SDcols = c(“价格”,“克拉”)](随着组数的增加,时间差异将更加显着)。 – Arun

+1

谢谢@阿伦。我在'lapply'里面如何添加'mean'和'sd'。这个选择听起来很棒。 – Ramnath