2016-06-09 94 views
1

我有数据,看起来像日期间隔除去

ID CLM_ID Date1 Date2 
1 718182 1/1/2014 1/17/2014 
1 718184 1/2/2014 1/8/2014 
1 885236 1/15/2014 1/17/2014 
1 885362 3/20/2014 3/21/2014 
2 589963 3/18/2015 3/22/2015 
2 589999 2/27/2015 5/9/2015 
2 594226 4/11/2015 4/17/2015 
2 689959 5/10/2015 6/10/2015 
3 656696 5/1/2016 5/5/2016 
3 669625 5/6/2016 5/22/2016 
4 777777 2/21/2015 3/4/2015 
4 778952 2/1/2015 2/28/2015 
4 778965 3/1/2015 3/22/2015 

我与这两个不同的问题的工作。第一个回答是在以前的文章中如何滚动日期(Date roll-up in R),第二个现在是我有间隔的时间间隔,我试图摆脱它们。所以,最后的结果应该像

ID CLM_ID Date1 Date2 
1 718182 1/1/2014 1/17/2014 
1 885362 3/20/2014 3/21/2014 
2 589999 2/27/2015 5/9/2015 
3 656696 5/1/2016 5/22/2016 
4 778952 2/1/2015 3/22/2015 

现在我知道我必须首先创建通过日期汇总扩展区间,但随后我如何摆脱这些子区间(一个学期我做了间隔内的间隔)?我也在寻找一种高效的解决方案,因为我实际上有75,000条记录要通过(即我试图避免迭代解决方案)。

回答

2

使用non-equicurrent development version of data.table, v1.9.7加入,

require(data.table) # v1.9.7+ 
dt[dt, .(CLM_IDs = CLM_IDs[.N==1L]), on=.(ID, Date1<=Date1, Date2>=Date2), by=.EACHI] 
# ID  Date1  Date2 CLM_ID 
# 1: 1 2014-01-01 2014-01-17 718182 
# 2: 1 2014-03-20 2014-03-21 885362 
# 3: 2 2015-02-27 2015-05-09 589999 
# 4: 2 2015-05-10 2015-06-10 689959 
# 5: 3 2016-05-01 2016-05-05 656696 
# 6: 3 2016-05-06 2016-05-22 669625 
# 7: 4 2015-02-21 2015-03-04 777777 
# 8: 4 2015-02-01 2015-02-28 778952 
# 9: 4 2015-03-01 2015-03-22 778965 

这样做是什么,在dt(括号内的)每一行,它查找该行匹配dt(在外面)基于提供给on参数的条件。

如果唯一的匹配是自匹配(因为该条件还包括相等性),则返回匹配的行索引。这通过CLM_IDs[.N == 1L]来完成,其中.N保存每个组的观察数目。

+0

有没有什么办法可以在不安装'Rtools'的情况下安装v1.9.7。我问,因为我需要管理员权限才能安装'Rtools',目前安装v1.9.7是不可能的。如果不是,在v1.9.6中有没有不同的方式来做到这一点?谢谢。 –

+0

@JosephWood,遗憾的是还没有:-(..我们打算在某些时候直接使用windows二进制文件..我会问Jan(另一个data.table撰稿人)关于这个..因为他在这些方面做了很多工作 – Arun

+1

@JosephWood它在TODO列表中,一般等待[r-appveyor#29](https://github.com/krlmlr/r-appveyor/issues/29)。你可以从[appveyor build artifacts ](https://ci.appveyor.com/project/Rdatatable/data-table/history)并从zip安装。不幸的是由于[r-appveyor#69](https://github.com/krlmlr/r -appveyor/issues/69)二进制文件现在已经7天了,所以不包括非Equi加入和'.EACHI'。 – jangorecki

1

这是一个不太漂亮的解决方案,将每一行与所有其他行的日期进行比较。我将3015年的一年更正为2015年。但结果与您所期望的不同。要么我误解了你的问题,要么你误读了这些数据。

数据:

dta <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), 
         CLM_ID = c(718182L, 718184L, 885236L, 885362L, 589963L, 589999L, 594226L, 689959L, 656696L, 669625L, 777777L, 778952L, 778965L), 
         Date1 = structure(c(1L, 3L, 2L, 9L, 8L, 6L, 10L, 12L, 11L, 13L, 5L, 4L, 7L), .Label = c("1/1/2014", "1/15/2014", "1/2/2014", "2/1/2015", "2/21/2015", "2/27/2015", "3/1/2015", "3/18/2015", "3/20/2014", "4/11/2015", "5/1/2016", "5/10/2015", "5/6/2016"), class = "factor"), 
         Date2 = structure(c(1L, 2L, 1L, 4L, 5L, 10L, 7L, 11L, 9L, 8L, 6L, 3L, 5L), .Label = c("1/17/2014", "1/8/2014", "2/28/2015", "3/21/2014", "3/22/2015", "3/4/2015", "4/17/2015", "5/22/2016", "5/5/2016", "5/9/2015", "6/10/2015"), class = "factor")), 
       .Names = c("ID", "CLM_ID", "Date1", "Date2"), class = "data.frame", 
       row.names = c(NA, -13L)) 

代码:

dta$Date1 <- as.Date(dta$Date1, format = "%m/%d/%Y") 
dta$Date2 <- as.Date(dta$Date2, format = "%m/%d/%Y") 

# Boolean vector to memorize results 
keep <- logical(length = nrow(dta)) 
for(i in 1:nrow(dta)) { 
    match <- dta[dta$Date1 <= dta$Date1[i] & dta$Date2 >= dta$Date2[i], ] 
    if(nrow(match) == 1) keep[i] <- TRUE 
} 

# Result 
dta[keep, ] 
+0

我很感谢你的解决方案,但我试图避免使用'for'循环,因为实际的数据集长度为75,000行。 – akash87

+0

然后你应该添加到帖子。 – sebastianmm

2

“我也在寻找一个解决方案,是高效 ...(即我试图避免重复的解决方案)。 “

“你的假设是你的世界之窗,每隔一段时间擦洗一次,否则光线不会进入。” - 艾萨克阿西莫夫

下面是一个超快的base R迭代解决方案。它几乎立即为非常大的数据帧返回正确的结果。 (它也称为“辊式”中的数据,所以没有必要进行两个算法):

MakeDFSubInt <- function(df, includeCost = FALSE) { 
    ## Sorting the data frame to allow for fast 
    ## creation of the "Contained" logical vector below 
    tempDF <- df[order(df$ID, df$Date1, df$Date2), ] 
    UniIDs <- unique(tempDF$ID) 
    Len <- length(UniIDs) 

    ## Determine starting (i.e. "s") and ending (i.e. "e") 
    ## points of the respective groups of IDs 
    e <- which(diff(tempDF$ID)==1) 
    s <- c(1L, e + 1L) 
    dfLen <- nrow(tempDF) 
    e <- c(e, dfLen) 

    ## Converting dates to integers so that comparison 
    ## will be faster. Internally dates are stored as 
    ## integers, so this isn't a problem 
    dte1 <- as.integer(tempDF$Date1) 
    dte2 <- as.integer(tempDF$Date2) 

    ## Building logical vector in order to quickly create sub-intervals 
    Contained <- rep(FALSE, dfLen) 

    BegTime <- Sys.time() ## Included to measure time of for loop execution 

    for (j in 1:Len) { 
     Compare <- ifelse(dte2[s[j]] >= (dte1[s[j]+1L]+1L), max(dte2[s[j]], dte2[s[j]+1L]), dte2[s[j]+1L]) 
     for (x in (s[j]+1L):e[j]) { 
      if (!Contained[x-1L]) { 
       Contained[x] <- dte2[x-1L] >= (dte1[x]-1L) 
      } else { 
       Contained[x] <- Compare >= (dte1[x]-1L) 
      } 

      ## could use ifelse, but this construct is faster 
      if (Contained[x]) { 
       Compare <- max(Compare, dte2[x]) 
      } else { 
       Compare <- dte2[x] 
      } 
     } 
    } 

    EndTime <- Sys.time() 
    TotTime <- EndTime - BegTime 
    if (printTime) {print(paste(c("for loop execution time was: ", format(TotTime)), collapse = ""))} 

    ## identify sub-intervals 
    nGrps <- which(!Contained) 

    ## Create New fields for our new DF 
    ID <- tempDF$ID[nGrps] 
    CLM_ID <- tempDF$CLM_ID[nGrps] 
    Date1 <- tempDF$Date1[nGrps] 
    nGrps <- c(nGrps, dfLen+1L) 

    ## as.Date is converting numbers to dates. 
    ## N.B. This only works if origin is supplied 
    Date2 <- as.Date(vapply(1L:(length(nGrps) - 1L), function(x) { 
        max(dte2[nGrps[x]:(nGrps[x+1L]-1L)])}, 1L), origin = "1970-01-01") 

    ## in a related question the OP had, "Cost" was 
    ## included to show how the algorithm would handle 
    ## generic summary information 
    if (includeCost) { 
     myCost <- tempDF$Cost 
     Cost <- vapply(1L:(length(nGrps) - 1L), function(x) sum(myCost[nGrps[x]:(nGrps[x+1L]-1L)]), 100.01) 
     NewDf <- data.frame(ID,CLM_ID,Date1,Date2,Cost) 
    } else { 
     NewDf <- data.frame(ID,CLM_ID,Date1,Date2) 
    } 

    NewDf 
} 

对于在问题中给出的例子中,我们有:

ID <- c(rep(1,4),rep(2,4),rep(3,2),rep(4,3)) 
CLM_ID <- c(718182, 718184, 885236, 885362, 589963, 589999, 594226, 689959, 656696, 669625, 777777, 778952, 778965) 
Date1 <- c("1/1/2014","1/2/2014","1/15/2014","3/20/2014","3/18/2015","2/27/2015","4/11/2015","5/10/2015","5/1/2016","5/6/2016","2/21/2015","2/1/2015","3/1/2015") 
Date2 <- c("1/17/2014","1/8/2014","1/17/2014","3/21/2014","3/22/2015","5/9/2015","4/17/2015","6/10/2015","5/5/2016","5/22/2016","3/4/2015","2/28/2015","3/22/2015") 
myDF <- data.frame(ID, CLM_ID, Date1, Date2) 
myDF$Date1 <- as.Date(myDF$Date1, format = "%m/%d/%Y") 
myDF$Date2 <- as.Date(myDF$Date2, format = "%m/%d/%Y") 

MakeDFSubInt(myDF) 
ID CLM_ID  Date1  Date2 
1 1 718182 2014-01-01 2014-01-17 
2 1 885362 2014-03-20 2014-03-21 
3 2 589999 2015-02-27 2015-06-10 
4 3 656696 2016-05-01 2016-05-22 
5 4 778952 2015-02-01 2015-03-22 

similar question OP张贴,我们可以添加一个Cost字段,以显示我们将如何进行此设置的计算。

set.seed(7777) 
myDF$Cost <- round(rnorm(13, 450, sd = 100),2) 

MakeDFSubInt(myDF, includeCost = TRUE) 
ID CLM_ID  Date1  Date2 Cost 
1 1 718182 2014-01-01 2014-01-17 1164.66 
2 1 885362 2014-03-20 2014-03-21 568.16 
3 2 589999 2015-02-27 2015-06-10 2019.16 
4 3 656696 2016-05-01 2016-05-22 990.14 
5 4 778952 2015-02-01 2015-03-22 1578.68 

该算法可以很好地扩展。对于OP所需查找的数据帧,几乎瞬间返回请求的DF返回,对于非常大的数据帧,返回的时间仅为几秒。

首先我们构建一个函数,它将生成一个随机数据帧,其中有n行。

MakeRandomDF <- function(n) { 
    set.seed(109) 

    CLM_Size <- ifelse(n < 10^6, 10^6, 10^(ceiling(log10(n)))) 
    numYears <- trunc((6/425000)*n + 5) 
    StrtYear <- ifelse(numYears > 16, 2000, 2016 - numYears) 
    numYears <- ifelse(numYears > 16, 16, numYears) 

    IDs <- sort(sample(trunc(n/100), n, replace = TRUE)) 
    CLM_IDs <- sample(CLM_Size, n) 
    StrtDate <- as.Date(paste(c(as.character(StrtYear),"-01-01"), collapse = "")) 
    myPossibleDates <- StrtDate+(0:(numYears*365)) ## "numYears" years of data 
    Date1 <- sample(myPossibleDates, n, replace = TRUE) 
    Date2 <- Date1 + sample(1:100, n, replace = TRUE) 
    Cost <- round(rnorm(n, 850, 100), 2) 

    tempDF <- data.frame(IDs,CLM_IDs,Date1,Date2,Cost) 
    tempDF$Date1 <- as.Date(tempDF$Date1, format = "%m/%d/%Y") 
    tempDF$Date2 <- as.Date(tempDF$Date2, format = "%m/%d/%Y") 

    tempDF 
} 

对于中等尺寸的DF(即,75000行)

TestDF <- MakeRandomDF(75000) 
system.time(test1 <- MakeDFSubInt(TestDF, includeCost = TRUE, printTime = TRUE)) 
[1] "for loop execution time was: 0.06500006 secs" 
    user system elapsed 
    0.14 0.00 0.14 

nrow(test1) 
[1] 7618 

head(test1) 
    ID CLM_ID  Date1  Date2  Cost 
1 1 116944 2010-01-29 2010-01-30 799.90 ## The range of dates for 
2 1 515993 2010-02-15 2011-10-12 20836.83 ## each row are disjoint 
3 1 408037 2011-12-13 2013-07-21 28149.26 ## as requested by the OP 
4 1 20591 2013-07-25 2014-03-11 10449.51 
5 1 338609 2014-04-24 2014-07-31 4219.48 
6 1 628983 2014-08-03 2014-09-11 2170.93 


对于非常大的DF(即> 50万行)

TestDF2 <- MakeRandomDF(500000) 
system.time(test2 <- MakeDFSubInt(TestDF2, includeCost = TRUE, printTime = TRUE)) 
[1] "for loop execution time was: 0.3679998 secs" 
    user system elapsed 
    1.19 0.03 1.21 

nrow(test2) 
[1] 154839 

head(test2) 
    ID CLM_ID  Date1  Date2 Cost 
1 1 71251 2004-04-19 2004-06-29 2715.69 ## The range of dates for 
2 1 601676 2004-07-05 2004-09-23 2675.04 ## each row are disjoint 
3 1 794409 2004-12-28 2005-04-05 1760.63 ## as requested by the OP 
4 1 424671 2005-06-03 2005-08-20 1973.67 
5 1 390353 2005-09-16 2005-11-06 785.81 
6 1 496611 2005-11-21 2005-11-24 904.09 

system.time(test3 <- MakeDFSubInt(TestDF3, includeCost = TRUE, printTime = TRUE)) 
[1] "for loop execution time was: 0.6930001 secs" 
    user system elapsed 
    2.68 0.08 2.79  ## 1 million rows in under 3 seconds!!! 

nrow(test3) 
[1] 413668 


说明

该算法的主要部分产生Contained逻辑向量,用于确定连续日期的子区间。这个矢量的生成依赖于这样一个事实,即数据帧首先由ID,第二个为Date1,最后为Date2。我们首先查找每组ID的起始和结束行。例如,与例如通过OP提供有:

myDF 
    ID CLM_ID  Date1  Date2 
1 1 718182 2014-01-01 2014-01-17 ## <- 1 s[1] 
2 1 718184 2014-01-02 2014-01-08 
3 1 885236 2014-01-15 2014-01-17 
4 1 885362 2014-03-20 2014-03-21 ## <- 4 e[1] 
5 2 589963 2015-03-18 2015-03-22 ## <- 5 s[2] 
6 2 589999 2015-02-27 2015-05-09 
7 2 594226 2015-04-11 2015-04-17 
8 2 689959 2015-05-10 2015-06-10 ## <- 8 e[2] 
9 3 656696 2016-05-01 2016-05-05 ## <- 9 s[3] 
10 3 669625 2016-05-06 2016-05-22 ## <- 10 e[3] 
11 4 777777 2015-02-21 2015-03-04 ## <- 11 s[4] 
12 4 778952 2015-02-01 2015-02-28 
13 4 778965 2015-03-01 2015-03-22 ## <- 13 e[4] 

下面是产生se的代码。

## Determine starting (i.e. "s") and ending (i.e. "e") 
## points of the respective groups of IDs 
e <- which(diff(tempDF$ID)==1) 
s <- c(1L, e + 1L) 
dfLen <- nrow(tempDF) 
e <- c(e, dfLen) 

s 
1 5 9 11 

e 
4 8 10 13 

现在,我们遍历每个组并开始填充逻辑向量Contained。如果特定行的日期范围与其上方的日期范围重叠(或连续),我们将该特定索引Contained设置为TRUE。这就是为什么每个组中的第一行被设置为FALSE,因为上面没有任何内容可以比较它。在我们这样做时,我们正在更新最大日期,以便与前进比较,因此变量为Compare。应该指出的是,Date2[n] < Date2[n+1L]并不一定是这样,这就是为什么Compare <- max(Compare, dte2[x])连续TRUEs。我们的例子的结果如下。

ID CLM_ID  Date1  Date2 Contained 
1 1 718182 2014-01-01 2014-01-17  FALSE 
2 1 718184 2014-01-02 2014-01-08  TRUE ## These two rows are contained 
3 1 885236 2014-01-15 2014-01-17  TRUE ## in the date range 1/1 - 1/17 
4 1 885362 2014-03-20 2014-03-21  FALSE ## This row isn't 
6 2 589999 2015-02-27 2015-05-09  FALSE 
5 2 589963 2015-03-18 2015-03-22  TRUE 
7 2 594226 2015-04-11 2015-04-17  TRUE 
8 2 689959 2015-05-10 2015-06-10  TRUE ## N.B. 5/10 is a continuance of 5/09 
9 3 656696 2016-05-01 2016-05-05  FALSE 
10 3 669625 2016-05-06 2016-05-22  TRUE 
12 4 778952 2015-02-01 2015-02-28  FALSE 
11 4 777777 2015-02-21 2015-03-04  TRUE 
13 4 778965 2015-03-01 2015-03-22  TRUE 

现在我们可以通过识别与相应FALSE所有行轻松识别“开始”行。在此之后,通过简单计算您在TRUEs和Voila的每个继承中感兴趣的任何内容(例如max(Date2),sum(Cost)),即可轻松查找摘要信息!

+0

感谢您的解释。除了'system.time'之外,还有一种编程方式可以知道循环的效率吗? – akash87

+0

@ akash87,R中有很多关于'for loops'的误解。是的,他们可能会很慢,但正如Richie Cotton指出的[这里](http://stackoverflow.com/a/6466415/4408538)(请查看Gavin Simpson在该页面上接受的答案),如果您有一个非平凡的任务(就像我们上面所做的那样),'for'循环是要走的路(如果正确实施)。我强烈建议阅读[本文](http://stackoverflow.com/a/2276001/4408538)和[本文](http://stackoverflow.com/q/28983292/4408538)以更好地了解R的循环结构。 –

+0

@ akash87,我修改了我的算法以返回执行'for循环'所需的时间。 –