2012-10-27 71 views
1

可能重复:
How to speed up cummulative sum within group?查找每个ID的最大值,按日期分组

在下面的数据帧

id<-c(1,1,1,1,1,3,3,3,3) 
spent<-c(10,20,30,40,50,60,70,80,90) 
date<-c("11-11-07","11-11-07","23-11-07","12-12-08","17-12-08","11-11-07","23-11-07","23-  11-07","16-01-08") 
df<-data.frame(id,date,spent) 
df$date2<-as.Date(as.character(df$date), format = "%d-%m-%y") 

    id  date spent  date2 
1 1 11-11-07 10 2007-11-11 
2 1 11-11-07 20 2007-11-11 
3 1 23-11-07 30 2007-11-23 
4 1 12-12-08 40 2008-12-12 
5 1 17-12-08 50 2008-12-17 
6 3 11-11-07 60 2007-11-11 
7 3 23-11-07 70 2007-11-23 
8 3 23-11-07 80 2007-11-23 
9 3 16-01-08 90 2008-01-16 

我需要找到最大spent为每个id在每一天,并记录在一个单独的列作为fol低:

id  date spent  date2 sum.spent 
1 1 11-11-07 10 2007-11-11 20 
2 1 11-11-07 20 2007-11-11 20 
3 1 23-11-07 30 2007-11-23 30 
4 1 12-12-08 40 2008-12-12 40 
5 1 17-12-08 50 2008-12-17 50 
6 3 11-11-07 60 2007-11-11 60 
7 3 23-11-07 70 2007-11-23 80 
8 3 23-11-07 80 2007-11-23 80 
9 3 16-01-08 90 2008-01-16 90 

任何人都可以帮助我吗?

+0

你可以只取[您的其他问题(http://stackoverflow.com/q/13081821/567015)的答案,并用'max'替换'cumsum'。基本上这些问题是完全一样的。 –

+0

@SachaEpskamp,我没有看到。考虑到它与前面的问题在概念上是相同的,我正在投票结束这一项。 – A5C1D2H2I1M1N2O1R2T1

+0

没有必要。这是一个明确提出的问题,已经得到了公认的答案。 –

回答

4

这里是你的plyr答案:

library(plyr) 
ddply(df, .(id, date), transform, sum.spent = max(spent)) 

这是data.table答案(对更大的d更好atasets):

library(data.table) 
df <- data.table(df) 
df[, sum.spent:=max(spent), by = list(id, date)] 
5

下面是一个简单的方法使用ave()

df$sum.spent <- ave(df$spent, df$id, df$date2, FUN = max) 
df 
# id  date spent  date2 sum.spent 
# 1 1 11-11-07 10 2007-11-11  20 
# 2 1 11-11-07 20 2007-11-11  20 
# 3 1 23-11-07 30 2007-11-23  30 
# 4 1 12-12-08 40 2008-12-12  40 
# 5 1 17-12-08 50 2008-12-17  50 
# 6 3 11-11-07 60 2007-11-11  60 
# 7 3 23-11-07 70 2007-11-23  80 
# 8 3 23-11-07 80 2007-11-23  80 
# 9 3 16-01-08 90 2008-01-16  90 

它使用data.table()也简单:

library(data.table) 
# data.table 1.8.2 For help type: help("data.table") 
dfDT <- data.table(df, key="id,date2") 
dfDT[, sum.spent:=max(spent), by=key(dfDT)] 
# id  date spent  date2 sum.spent 
# 1: 1 11-11-07 10 2007-11-11  20 
# 2: 1 11-11-07 20 2007-11-11  20 
# 3: 1 23-11-07 30 2007-11-23  30 
# 4: 1 12-12-08 40 2008-12-12  40 
# 5: 1 17-12-08 50 2008-12-17  50 
# 6: 3 11-11-07 60 2007-11-11  60 
# 7: 3 23-11-07 70 2007-11-23  80 
# 8: 3 23-11-07 80 2007-11-23  80 
# 9: 3 16-01-08 90 2008-01-16  90 
+0

我有一个内存错误,因为我有近150万行。 – AliCivil

+1

@AliTamaddoni,然后尝试我刚发布的'data.table()'解决方案。 – A5C1D2H2I1M1N2O1R2T1