2016-12-16 46 views
0

我下面的数据集R:滚动/滑动中的R窗口和重复计数滑动天数

set.seed(1) 
transaction_date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/02/01'), by="day"), 24) 
set.seed(1) 
df <- data.frame("categ" = paste0("Categ",rep(1:2,12)), "prod" = sample(paste0("Prod",rep(seq(1:3),8))), customer_id = paste0("customer ",seq(1:24)),transaction_date=transaction_date) 
df_ordered <- df[order(df$cate,df$prod,df$transaction_date,df$customer_id),] 
df_ordered 

categ prod customer_id transaction_date 
1 Categ1 Prod1 customer 1  2016-01-09 
3 Categ1 Prod1 customer 3  2016-01-18 
19 Categ1 Prod1 customer 19  2016-01-28 
7 Categ1 Prod1 customer 7  2016-01-29 
5 Categ1 Prod2 customer 5  2016-01-06 
23 Categ1 Prod2 customer 23  2016-01-07 
13 Categ1 Prod2 customer 13  2016-01-14 
9 Categ1 Prod2 customer 9  2016-01-16 
15 Categ1 Prod2 customer 15  2016-01-20 
21 Categ1 Prod2 customer 21  2016-01-24 
11 Categ1 Prod3 customer 11  2016-01-05 
17 Categ1 Prod3 customer 17  2016-01-31 
10 Categ2 Prod1 customer 10  2016-01-02 
20 Categ2 Prod1 customer 20  2016-01-11 
24 Categ2 Prod1 customer 24  2016-01-23 
16 Categ2 Prod1 customer 16  2016-02-01 
12 Categ2 Prod2 customer 12  2016-01-04 
4 Categ2 Prod2 customer 4  2016-01-27 
22 Categ2 Prod3 customer 22  2016-01-03 
14 Categ2 Prod3 customer 14  2016-01-08 
2 Categ2 Prod3 customer 2  2016-01-12 
18 Categ2 Prod3 customer 18  2016-01-15 
8 Categ2 Prod3 customer 8  2016-01-17 
6 Categ2 Prod3 customer 6  2016-01-25 

我已经做了12天,从第一个窗口,独特的客户数超过(最小)在categprod上观察到的transaction_date。

在当前交易日期前12天滑动窗口,并计入该存储桶中的所有交易的计数。以下是我正在尝试创建的输出。我想避免为这个任务循环。

enter image description here

+1

的可能的复制[通过data.table非等距相对窗运行总和加入(http://stackoverflow.com/questions/41007099/relative-windowed-running-sum-through-data-table- non-equi-join) – ExperimenteR

回答

3

运用zoo这个dplyrrollapply可以实现的。首先,我们填写所有组的所有缺失日期,以便我们有一个连续的系列,使用expand.gridmerge。然后,我们按类别和产品进行分组,按日期进行排列,并将滚动窗口应用于客户ID中的值。我们定义的在每个步骤中应用的函数采用唯一值向量的长度,并删除了NAs。最后,我们再次过滤出添加的日期,其中customer_id不可用。

library(dplyr) 
library(zoo) 

set.seed(1) 
transaction_date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/02/01'), by="day"), 24) 
set.seed(1) 
df <- data.frame("categ" = paste0("Categ",rep(1:2,12)), "prod" = sample(paste0("Prod",rep(seq(1:3),8))), customer_id = paste0("customer ",seq(1:24)),transaction_date=transaction_date) 

all_combinations <- expand.grid(categ=unique(df$categ), 
     prod=unique(df$prod), 
     transaction_date=seq(min(df$transaction_date), max(df$transaction_date), by="day")) 

df <- merge(df, all_combinations, by=c('categ','prod','transaction_date'), all=TRUE) 

res <- df %>% 
     group_by(categ, prod) %>% 
     arrange(transaction_date) %>% 
     mutate(ucust=rollapply(customer_id, width=12, FUN=function(x) length(unique(x[!is.na(x)])), partial=TRUE, align='left')) %>% 
     filter(!is.na(customer_id)) 
+1

对不起,我太快了,现有日期重复。我现在纠正了它。 – mpjdem