2011-02-17 55 views
17

我反复使用的设计模式之一是在数据框上执行“group by”或“split,apply,combine(SAC)”,然后加入聚合数据回到原始数据。例如,在计算每个县与许多州和县的数据框中的州平均数偏差时,这很有用。我的总计算很少是一个简单的意思,但它是一个很好的例子。我经常解决这一问题的方式如下:将聚合值加回到原始数据框

require(plyr) 
set.seed(1) 

## set up some data 
group1 <- rep(1:3, 4) 
group2 <- sample(c("A","B","C"), 12, rep=TRUE) 
values <- rnorm(12) 
df <- data.frame(group1, group2, values) 

## got some data, so let's aggregate 

group1Mean <- ddply(df, "group1", function(x) 
        data.frame(meanValue = mean(x$values))) 
df <- merge(df, group1Mean) 
df 

将会产生很好的汇总数据,如下列:

> df 
    group1 group2 values meanValue 
1  1  A 0.48743 -0.121033 
2  1  A -0.04493 -0.121033 
3  1  C -0.62124 -0.121033 
4  1  C -0.30539 -0.121033 
5  2  A 1.51178 0.004804 
6  2  B 0.73832 0.004804 
7  2  A -0.01619 0.004804 
8  2  B -2.21470 0.004804 
9  3  B 1.12493 0.758598 
10  3  C 0.38984 0.758598 
11  3  B 0.57578 0.758598 
12  3  A 0.94384 0.758598 

这工作,但没有这样做,其提高可读性的替代方式,性能,等等?代码

+0

请参阅http://stackoverflow.com/questions/4998846/applying-an-aggregate-function-over-multiple-different-slices/5000040#5000040 – 2011-02-17 15:46:40

回答

18

一号线的伎俩:

new <- ddply(df, "group1", transform, numcolwise(mean)) 
new 

group1 group2  values meanValue 
1  1  A 0.48742905 -0.121033381 
2  1  A -0.04493361 -0.121033381 
3  1  C -0.62124058 -0.121033381 
4  1  C -0.30538839 -0.121033381 
5  2  A 1.51178117 0.004803931 
6  2  B 0.73832471 0.004803931 
7  2  A -0.01619026 0.004803931 
8  2  B -2.21469989 0.004803931 
9  3  B 1.12493092 0.758597929 
10  3  C 0.38984324 0.758597929 
11  3  B 0.57578135 0.758597929 
12  3  A 0.94383621 0.758597929 

identical(df, new) 
[1] TRUE 
+0

我忘记了所有关于`transform`的信息。后见之明。但是,谢谢你说明我不熟悉的`numcolwise`。 – 2011-02-17 15:57:22

+0

这是一个很好的习惯用法,但是当一些变量应该是总和和其他的含义时,要做的很棘手。 – richiemorrisroe 2012-08-14 17:49:17

+0

@richiemorrisroe比任何其他成语都更棘手吗? – Andrie 2012-08-14 20:05:09

9

你就不能添加到x传递给ddply功能?

df <- ddply(df, "group1", function(x) 
      data.frame(x, meanValue = mean(x$values))) 
13

我觉得ave()是比较有用这里比plyr打电话告诉你(我不够熟悉plyr知道,如果你可以做你直接或不想与plyr什么,我会感到惊讶,如果你不能)或其他基础R替代品(aggregate()tapply()):

> with(df, ave(values, group1, FUN = mean)) 
[1] -0.121033381 0.004803931 0.758597929 -0.121033381 0.004803931 
[6] 0.758597929 -0.121033381 0.004803931 0.758597929 -0.121033381 
[11] 0.004803931 0.758597929 

您可以使用within()transform()直接嵌入这个结果到df

> df2 <- within(df, meanValue <- ave(values, group1, FUN = mean)) 
> head(df2) 
    group1 group2  values meanValue 
1  1  A 0.4874291 -0.121033381 
2  2  B 0.7383247 0.004803931 
3  3  B 0.5757814 0.758597929 
4  1  C -0.3053884 -0.121033381 
5  2  A 1.5117812 0.004803931 
6  3  C 0.3898432 0.758597929 
> df3 <- transform(df, meanValue = ave(values, group1, FUN = mean)) 
> all.equal(df2,df3) 
[1] TRUE 

而且,如果排序是非常重要的:

> head(df2[order(df2$group1, df2$group2), ]) 
    group1 group2  values meanValue 
1  1  A 0.48742905 -0.121033381 
10  1  A -0.04493361 -0.121033381 
4  1  C -0.30538839 -0.121033381 
7  1  C -0.62124058 -0.121033381 
5  2  A 1.51178117 0.004803931 
11  2  A -0.01619026 0.004803931 
13

在性能方面,你可以做到这一点同一种使用data.table包,里面有内置的聚集和非常快多亏指标和操作基于C的实现。例如,给出df已经存在于你的例子中:

 
library("data.table") 
dt<-as.data.table(df) 
setkey(dt,group1) 
dt<-dt[,list(group2,values,meanValue=mean(values)),by=group1] 
dt 
     group1 group2  values meanValue 
[1,]  1  A 0.82122120 0.18810771 
[2,]  1  C 0.78213630 0.18810771 
[3,]  1  C 0.61982575 0.18810771 
[4,]  1  A -1.47075238 0.18810771 
[5,]  2  B 0.59390132 0.03354688 
[6,]  2  A 0.07456498 0.03354688 
[7,]  2  B -0.05612874 0.03354688 
[8,]  2  A -0.47815006 0.03354688 
[9,]  3  B 0.91897737 -0.20205707 
[10,]  3  C -1.98935170 -0.20205707 
[11,]  3  B -0.15579551 -0.20205707 
[12,]  3  A 0.41794156 -0.20205707

I have not benchmarked it, but in my experience it is a lot faster.

If you decide to go down the data.table road, which I think is worth exploring if you work with large data sets, you really need to read the docs because there are some differences from data frame that can bite you if you are unaware of them. However, notably data.table generally does work with any function expecting a data frame,as a data.table will claim its type is data frame (data table inherits from data frame).

[ Feb 2011 ]


[ Aug 2012 ] Update from Matthew :

New in v1.8.2 released to CRAN in July 2012 is :=按组。这与上面的答案非常相似,但是将新列添加到dt,因此没有副本,也不需要合并步骤或重新存在现有列以便与聚合一起返回。首先不需要setkey,它可以处理非连续的组(即未组合在一起的组)。

这是signficantly更快大型数据集,并具有简单的和短的语法:

dt <- as.data.table(df) 
dt[, meanValue := mean(values), by = group1] 
1

dplyr可能性:

library(dplyr) 
df %>% 
    group_by(group1) %>% 
    mutate(meanValue = mean(values)) 

这将返回原始顺序的数据帧。如果您希望通过“group1”订购,请将arrange(group1)添加到管道。

相关问题