R：通过ID汇总历史记录日期

我有一个庞大的数据集，它具有个人以及日期的唯一ID，并且每个人都能够多次遇到。R：通过ID汇总历史记录日期

下面是代码和这个数据可能外观的示例：

strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16",  
"6/8/16", "5/8/16","2/3/16","1/1/16") 
Date<-as.Date(strDates, "%m/%d/%y") 
ID <- c("A", "A", "A", "A","A","B","B","B","B","B") 
Event <- c(1,0,1,0,1,0,1,1,1,0) 
sample_df <- data.frame(Date,ID,Event) 

sample_df 

     Date ID Event 
1 2016-09-09 A  1 
2 2016-06-07 A  0 
3 2016-05-06 A  1 
4 2016-02-03 A  0 
5 2016-02-01 A  1 
6 2016-11-08 B  0 
7 2016-06-08 B  1 
8 2016-05-08 B  1 
9 2016-02-03 B  1 
10 2016-01-01 B  0

我想保持每遇到的所有附属信息，但随后汇总由ID下面的历史信息

以前的遭遇人数
前期活动数量

举例来说，让我们看第2行。

第2行是ID A，因此我会引用第3-5行（发生在第2行遭遇之前）。在这组行中，我们看到Row 3 & 5都有事件发生。

上遭遇的号排2 = 3

为行2之前的活动数= 2

理想情况下，我会得到下面的输出：

  Date ID Event PrevEnc PrevEvent 
1 2016-09-09 A  1  4   2 
2 2016-06-07 A  0  3   2 
3 2016-05-06 A  1  2   1 
4 2016-02-03 A  0  1   1 
5 2016-02-01 A  1  0   0 
6 2016-11-08 B  0  4   3 
7 2016-06-08 B  1  3   2 
8 2016-05-08 B  1  2   1 
9 2016-02-03 B  1  1   0 
10 2016-01-01 B  0  0   0

到目前为止，我已经尝试在dplyr中通过mutate和总结来解决这个问题，两者都没有让我成功地将我的聚合限制为以前针对特定ID发生的事件。我用If-then语句尝试了一些乱七八糟的For循环，但真的只是想知道是否有包或技术来简化这个过程。

谢谢！

来源

2016-11-11 EntryLevelR

最大的障碍是当前的排序顺序。在这里，我存储了一个原始索引点，后来我用它对数据进行重新排序（然后将其删除）。除此之外，基本思想是从0开始计数遇到的事件，并使用cumsum来计数发生的事件。为此，lag用于避免计算当前事件。

sample_df %>% 
    mutate(origIndex = 1:n()) %>% 
    group_by(ID) %>% 
    arrange(ID, Date) %>% 
    mutate(PrevEncounters = 0:(n() -1) 
     , PrevEvents = cumsum(lag(Event, default = 0))) %>% 
    arrange(origIndex) %>% 
    select(-origIndex)

给人

  Date  ID Event PrevEncounters PrevEvents 
     <date> <fctr> <dbl>   <int>  <dbl> 
1 2016-09-09  A  1    4   2 
2 2016-06-07  A  0    3   2 
3 2016-05-06  A  1    2   1 
4 2016-02-03  A  0    1   1 
5 2016-02-01  A  1    0   0 
6 2016-11-08  B  0    4   3 
7 2016-06-08  B  1    3   2 
8 2016-05-08  B  1    2   1 
9 2016-02-03  B  1    1   0 
10 2016-01-01  B  0    0   0

来源

2016-11-11 16:19:30

'0：（n（）-1）'是'row_number（） - 1L'？另外，我猜orig index可以是'row_number（）'。 – Frank

是的，@Frank - 这些应该是等价的。我不知道为什么我没有更频繁地使用'row_number（）'。有可能是一种懒惰的预习式方法。 –

谢谢你非常有帮助的方式来查看这个！滞后是def。我不知道的东西，现在很高兴收到！ – EntryLevelR

由于@Frank和@MarkPeterson指出，这里的最大障碍是，Date列按降序排列。不需要诉诸的Date列的另一种方法：

library(dplyr) 
res <- sample_df %>% group_by(ID) %>% 
        mutate(PrevEnc=n()-row_number(), 
          PrevEvent=rev(cumsum(lag(rev(Event), default=0))))

在这里，我们使用row_number()来确定行索引和n()确定的行数（由ID分组）。由于Date按降序排列，因此以前的相遇次数仅为n()-row_number()。为了计算先前事件的数量，我们再次利用Date列按降序排序并使用rev来颠倒Event列的顺序，此列反转之前为cumsum,lag。然后，我们再次使用rev将结果反转回原始顺序。

使用您的数据：

print(res) 
##Source: local data frame [10 x 5] 
##Groups: ID [2] 
## 
##   Date  ID Event PrevEnc PrevEvent 
##  <date> <fctr> <dbl> <int>  <dbl> 
##1 2016-09-09  A  1  4   2 
##2 2016-06-07  A  0  3   2 
##3 2016-05-06  A  1  2   1 
##4 2016-02-03  A  0  1   1 
##5 2016-02-01  A  1  0   0 
##6 2016-11-08  B  0  4   3 
##7 2016-06-08  B  1  3   2 
##8 2016-05-08  B  1  2   1 
##9 2016-02-03  B  1  1   0 
##10 2016-01-01  B  0  0   0

来源

2016-11-11 16:28:19 aichao

或者，如果你想尝试与data.table，您可以使用此：

library(data.table) 

# Convert to data.table and sort 
sample_dt <- as.data.table(sample_df) 
sample_dt <- sample_dt[order(Date)] 

# Count only the previous Events with 1 
sample_dt[, prevEvent := ifelse(Event == 1, cumsum(Event) - 1, cumsum(Event)), by = "ID"] 

# .I gives the row number, and .SD contains the Subset of the Data for each group 
sample_dt[, prevEnc := .SD[,.I - 1], by = "ID"] 

print(sample_dt) 
      Date ID Event prevEvent prevEnc 
1: 2016-01-01 B  0   0  0 
2: 2016-02-01 A  1   0  0 
3: 2016-02-03 A  0   1  1 
4: 2016-02-03 B  1   0  1 
5: 2016-05-06 A  1   1  2 
6: 2016-05-08 B  1   1  2 
7: 2016-06-07 A  0   2  3 
8: 2016-06-08 B  1   2  3 
9: 2016-09-09 A  1   2  4 
10: 2016-11-08 B  0   3  4

如果你不知道这个package，有一个很好的cheat sheet大部分的操作。

来源

2016-11-11 17:07:36

而不是caclulate'cumsum（Event）'两次，为什么不只'cumsum（Event） - （Event == 1）' – MichaelChirico

@MichaelChirico好点。我没有想过这个。 –

R：通过ID汇总历史记录日期

回答

相关问题