2016-05-25 60 views
4

我试图根据“2016-04-10”和“2016-04-24”按3个日期范围对数据框进行分组。R:dplyr组日期范围

df <- structure(list(date = structure(c(16803, 16810, 16817, 16824, 
16831, 16838, 16845, 16852, 16859, 16866, 16873, 16880, 16887, 
16894, 16901, 16908, 16915, 16922, 16929, 16936, 16943), class = "Date"), 
    new = c(1507L, 2851L, 3550L, 5329L, 7557L, 5546L, 6264L, 
    7160L, 9468L, 5789L, 5928L, 4642L, 8145L, 4867L, 4846L, 5231L, 
    7137L, 3938L, 3741L, 2937L, 194L), resolved = c(21, 27, 15, 
    16, 56, 2773, 8490, 8748, 9325, 7734, 10264, 6739, 6110, 
    9613, 10314, 10349, 7200, 9637, 10831, 11170, 5666), ost = c(1486, 
    2824, 3535, 5313, 7501, 2773, -2226, -1588, 143, -1945, -4336, 
    -2097, 2035, -4746, -5468, -5118, -63, -5699, -7090, -8233, 
    -5472)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-21L), .Names = c("date", "new", "resolved", "ost")) 

尝试了以下内容:

df1 <- df %>% group_by(dr=cut(date,breaks=as.Date(c("2016-04-10","2016-04-24")))) %>% 
       summarise(ost = sum(ost)) 

其中给出如下错误的结果。

 dr ost 
2016-04-10 -10586 
     NA -17885 

帮助感谢!

+0

如果你看'cut'输出,只有一些观察属于这个类别,否则所有的都是NAs – akrun

+0

'df%>%group_by(dr = cut(date,breaks = c(min(date) ,as.Date(c(“2016-04-10”,“2016-04-24”)),max(date)+ 1)))%>%summarize(ost = sum(ost))'? – alistaire

回答

5

您可以创建一个分组变量第一,

df %>% 
mutate(group = cumsum(grepl('2016-04-10|2016-04-24', date))) %>% 
group_by(group) %>% 
summarise(ost = sum(ost)) 

#Source: local data frame [3 x 2] 

# group ost 
# (int) (dbl) 
#1  0 8672 
#2  1 -10586 
#3  2 -26557 
+1

你可以通过mutate(group = cumsum(grepl('2016-04-10 | 2016-04-24',df $ date)))' – alistaire

+0

'@Sotos来添加'group'列。有用!你介意解释cumsum如何创建组? – woshishui

+0

喔cheez ...对...太早@alistaire :) – Sotos

4

我们创建cut分组变量“博士”。提到的breaks是'date'的range,即'date'的minmax值以及OP指定的日期,连接它(c),使用选项include.lowest并获得基于'ost'的sum在这个分组变量上。

df %>% 
    group_by(dr = cut(date, breaks = c(range(date), 
      as.Date(c("2016-04-10", "2016-04-24"))), include.lowest=TRUE)) %>% 
    summarise(ost =sum(ost)) 
#   dr ost 
#  <fctr> <dbl> 
#1 2016-01-03 8672 
#2 2016-04-10 -10586 
#3 2016-04-24 -26557 

或者另一种选择是findInterval这可能会更快相比cut

df %>% 
    group_by(dr = findInterval(date, as.Date(c("2016-04-10", "2016-04-24")))) %>% 
    summarise(ost = sum(ost)) 
#  dr ost 
# <int> <dbl> 
#1  0 8672 
#2  1 -10586 
#3  2 -26557 

注意:OP问了关于cut的问题,并且此解决方案给出了该问题。

+0

您能解释第一个吗?这里是我首先想到的,通过使用%>%将df传递给group_,然后有两个将group_by转换为group的参数。在group_by内使用cut将数字更改为cut(x,break,include.lowest = TRUE)的因子。 x是日期(因为我们希望按日期分组数据,break是给出日期将要被删除的时间间隔)。我不知道为什么你使用as.Date和include.lowest = TRUE意味着如果日期 – Learner

+0

然后这个输出将再次传递给新的函数,并且再次%>%并且总结(ost = sum(ost))显示了out列的总和。是否我理解正确? – Learner

+0

Sure ,我会添加描述 – akrun