2015-10-09 73 views
0

我有(片断)这种格式的数据:解圈的R代码重叠的时间间隔计算

 SW_Release deviceType  configStartDate  configEndDate 
1: 04.05.00   21 2005-11-03 19:12:36 2006-02-28 10:19:27 
2: 04.05.00   16 2005-11-04 03:59:05 2006-02-28 10:19:27 
3: 04.05.00   20 2005-11-04 03:59:06 2006-02-28 10:19:27 
4: 04.05.00   15 2005-11-04 03:59:06 2006-02-28 10:19:27 
5: 04.05.00   19 2005-11-04 03:59:06 2006-02-28 10:19:27 
6: 04.05.00   17 2005-11-04 03:59:06 2006-02-28 10:19:27 
7: 04.07.03   16 2006-02-28 10:19:27 2006-03-29 01:00:39 
8: 04.07.03   20 2006-02-28 10:19:27 2006-03-29 01:00:41 
9: 04.07.01   15 2006-02-28 10:19:27 2006-03-29 01:00:41 
10: 04.07.01   19 2006-02-28 10:19:27 2006-03-29 01:00:41 
11: 04.07.01   17 2006-02-28 10:19:27 2006-03-29 01:00:42 
12: 04.07.01   21 2006-02-28 10:19:27 2006-03-29 01:00:42 
13: 04.07.01   18 2006-02-28 10:19:27 2006-03-29 01:00:42 
14: 04.07.04   16 2006-03-29 01:00:40 2006-05-01 16:07:49 
15: 04.07.04   20 2006-03-29 01:00:41 2006-05-01 16:07:50 
16: 04.07.02   15 2006-03-29 01:00:41 2006-05-01 16:07:50 
17: 04.07.02   19 2006-03-29 01:00:41 2006-05-01 16:07:51 
18: 04.07.02   17 2006-03-29 01:00:42 2006-05-01 16:07:51 
19: 04.07.02   21 2006-03-29 01:00:42 2006-05-01 16:07:51 
20: 04.07.02   18 2006-03-29 01:00:42 2006-06-01 09:45:36 
21: 04.07.04   16 2006-05-02 09:47:57 2006-06-01 09:45:25 
22: 04.07.04   20 2006-05-02 09:47:57 2006-06-01 09:45:28 
23: 04.07.02   15 2006-05-02 09:47:58 2006-06-01 09:45:31 
24: 04.07.02   19 2006-05-02 09:47:58 2006-06-01 09:45:32 
25: 04.07.02   17 2006-05-02 09:47:58 2006-06-01 09:45:34 
26: 04.07.02   21 2006-05-02 09:47:58 2006-06-01 09:45:35 
27: 04.07.05   16 2006-06-01 09:45:27 2006-08-14 17:54:15 
28: 04.07.05   20 2006-06-01 09:45:29 2006-08-14 17:54:15 
29: 04.07.06   15 2006-06-01 09:45:31 2007-12-12 11:03:00 
30: 04.07.06   19 2006-06-01 09:45:33 2007-12-12 11:03:00 
31: 04.07.03   17 2006-06-01 09:45:35 2006-08-14 17:54:16 
32: 04.07.03   21 2006-06-01 09:45:35 2006-08-14 17:54:16 
33: 04.07.04   18 2006-06-01 09:45:37 2007-12-12 11:03:00 
34: 04.07.06   16 2006-08-14 17:54:15 2007-12-12 11:02:59 
35: 04.07.06   20 2006-08-14 17:54:15 2007-12-12 11:02:59 
36: 04.07.04   17 2006-08-14 17:54:16 2007-12-12 11:03:00 
37: 04.07.04   21 2006-08-14 17:54:16 2007-12-12 11:03:00 
38: 04.05.12   14 2011-06-17 15:40:13 2012-05-24 11:43:24 

我需要添加了所有的间隔(间第二到最后一个和最后一列),但如您所见,某些行具有重叠或部分重叠的间隔。

之前,我添加了所有的日子里,我需要完整的数据集(从上面的代码中来)转换成类似:

accumulated data: 
     configStartDate  configEndDate 
1: 2005-11-03 19:12:36 2007-12-12 11:03:00 
2: 2011-06-17 15:40:13 2012-05-24 11:43:24 
total days: 934.296 

下面是这样做我的R代码里面(它必须是R,虽然我正在考虑重新写在C++和使用RCPP):

merge_intervals <- function(interval_dt){ 
    interval_dt <- interval_dt[order(configStartDate), list(configStartDate, configEndDate)] 

    new_dt <- interval_dt[1, list(configStartDate, configEndDate)] 

    for (i in 2:dim(interval_dt)[1]) { 
    buff <- interval_dt[i, list(configStartDate, configEndDate)] 

    if (new_dt[dim(new_dt)[1], configEndDate] >= buff[, configStartDate]){ 
     if(new_dt[dim(new_dt)[1], configEndDate] >= buff[, configEndDate]){ 
     next 
     } 
     else{ 
     new_dt[dim(new_dt)[1], configEndDate := buff[, configEndDate]] 
     } 
    } 
    else { 
     new_dt <- rbind(new_dt, buff) 
    } 
    } 

    return(new_dt) 
} 

现在整件事花费约0.16秒,(与其他计算)上运行,但是,对于3000个独特的资产,创建计算时间开销8分钟。

如何将for循环转换成更快的东西来减少计算时间?谢谢!

+0

应该可以做矢量化。你想如何处理重叠的时间间隔?忽略重叠或将间隔合并成一个新的间隔,只考虑新的间隔? – Thierry

+1

对不起,但您的示例并未向我明确说明您要执行的操作。你如何从你在第一个街区显示的10个街区(全部在2006年)到第二个街区的两个街区(跨度为2005-2012)?你能准确地描述如何从样本输入到预期输出? – josliber

+0

我编辑了样本以包含所有行以使其更清晰。 –

回答

0

像这样?

df <- data.frame(
    id = 1:3, 
    start = Sys.time() + c(0, 1000, 3000), 
    end = Sys.time() + c(1500, 2000, 4000) 
) 
library(dplyr) 
df %>% 
    mutate(
    overlap = lead(start, 1, default = TRUE) < end, 
    interval = cumsum(overlap) 
) %>% 
    group_by(interval) %>% 
    summarise(start = min(start), end = max(end)) %>% 
    mutate(delta = end - start) %>% 
    summarise(total = sum(delta))