2017-04-11 27 views
0

在cumsum由列信息特定gouped基于我有一个数据帧,如下图所示:发生变异柱使用dplyr

WC  ASN TS Date  Loss 
7101 3 R 13-07-12 156.930422 
7101 3 R 02-08-12 168.401876 
7101 4 R 28-12-13 120.492081 
7101 4 R 16-10-15 46.012085 
7101 4 R 04-01-16 48.262409 
7101 21 L 01-12-12 30.750564 
7101 21 L 01-05-13 49.421243 
7101 21 L 04-06-13 87.294821 
7101 21 L 01-10-13 164.013138 

什么我想是使用代码如下所示:

df %>% 
    select(WC, ASN, Date, Loss) %>% 
    group_by(WC) %>% 
    arrange(WC, ASN, Date, Loss) %>% 
    mutate(Days = Date - lag(Date)) 

要生成一个新的表是这样的:

WC  ASN TS Date  Loss  Days Loss_A 
7101 3 R 13-07-12 156.930422 0 156.930422 
7101 3 R 02-08-12 168.401876 20 325.332298 
7101 4 R 28-12-13 120.492081 0 120.492081 
7101 4 R 16-10-15 46.012085 657 166.504166 
7101 4 R 04-01-16 48.262409 80 214.766575 
7101 21 L 01-12-12 30.750564 0 30.750564 
7101 21 L 01-05-13 49.421243 151 80.171807 
7101 21 L 04-06-13 87.294821 34 167.466628 
7101 21 L 01-10-13 164.013138 119 331.479766 

这里,

  1. 每个WC,ASN和TS(一个有序组合,例如7101,3,13-07-2012),戴斯将为0为第一行,那么它应该是= recent_date - lagged_date
  2. 并且Loss_A计算为cumsum,直到WC,ASN和TS中至少有一个(有序组合,例如7101,3,13-07-2012)不同。

如何修改dplyr中的代码以实现如上所示的最终表? mutate()不能正常工作,因为我想用lag()时,有没有更好的方法来做到这一点?

回答

1

这工作:

df %>% 
    # start by making a date column that's a recognized date class so you can perform 
    # operations on it. 
    mutate(date = as.Date(Date, format = "%d-%m-%y")) %>% 
    # then group by all of the columns you want to use to id groups 
    group_by(WC, ASN, TS) %>% 
    # then compute the time intervals between rows using ifelse to deal with 1st rows, 
    # and compute the cumulative total loss within each group. 
    mutate(Days = ifelse(is.na(lag(date)), 0, date - lag(date)), 
     Loss_A = cumsum(Loss)) 
    # drop the date column we created if you don't need it 
    select(-date) 

结果:

Source: local data frame [9 x 7] 
Groups: WC, ASN, TS [3] 

    WC ASN TS  Date  Loss Days Loss_A 
    <int> <int> <chr> <chr>  <dbl> <dbl>  <dbl> 
1 7101  3  R 13-07-12 156.93042  0 156.93042 
2 7101  3  R 02-08-12 168.40188 20 325.33230 
3 7101  4  R 28-12-13 120.49208  0 120.49208 
4 7101  4  R 16-10-15 46.01208 657 166.50417 
5 7101  4  R 04-01-16 48.26241 80 214.76657 
6 7101 21  L 01-12-12 30.75056  0 30.75056 
7 7101 21  L 01-05-13 49.42124 151 80.17181 
8 7101 21  L 04-06-13 87.29482 34 167.46663 
9 7101 21  L 01-10-13 164.01314 119 331.47977 
+0

非常感谢你的努力,但它不返回表,因为我想(最后表) –

+0

在哪里出现你想要的输出和这个表格产生的差异? – ulfelder

+0

它的工作!谢谢。 –