深吸一口气。这里是我的解决方案
初始化数据
tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
'converting to POSIXct for better date/time handling'
tmpdf$starttime <- as.POSIXct(tmpdf$starttime, tz = "GMT")
tmpdf$endtime <- as.POSIXct(tmpdf$endtime, tz = "GMT")
数据准备
要执行所需的操作您的完整使用数据必须被转换到日常使用的数据。所以我写了下面的函数来将数据准备成需要的格式。
#splits single usage data into two
splitToTwo <- function(list){
newList <- NULL
for (i in 1:nrow(list)){
tmp <- list[i,]
# set the end time of the first split as 23:59:59
list[i,]$endtime <- as.Date(list[i,]$starttime) + hours(23) + minutes(59) + seconds(59)
# set the start time of the second split as 00:00:01
tmp$starttime <- list[i,]$endtime + seconds(2)
# add the new df to the list
tmp <- rbind(tmp,list[i,])
newList <- rbind(newList,tmp)
}
return(newList)
}
#recursive function. Split the usage data into two till there are completely normalised to daily usage data
setDailyUsage <- function(tmpdf){
# create a exclusive subset where the usage spawns more than a day
list <- tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ]
# replace tmpdf with usage that started and ended the same day
tmpdf <- tmpdf[ as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) == 0,]
# call to our split function to split the dataset with usage spawning more than one day
split <- splitToTwo(list)
# add the now split data to our exclusive
tmpdf <- rbind(tmpdf,split)
if (nrow(tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ])>0){
tmpdf <- setDailyUsage(tmpdf)
}
return(tmpdf)
}
准备好的数据
我们准备的数据
preparedData <- setDailyUsage(tmpdf)
licensePlate starttime endtime
1 Y80901 2015-09-18 09:55:00 2015-09-18 17:55:00
5 AMG-999 2015-09-21 09:22:00 2015-09-21 14:22:00
7 W3188 2015-09-21 14:00:00 2015-09-21 16:00:00
21 Y80901 2015-09-18 23:00:00 2015-09-18 23:59:59
3 Y80901 2015-09-21 00:00:01 2015-09-21 12:00:00
31 Y80901 2015-09-20 15:00:00 2015-09-20 23:59:59
4 AMG-999 2015-09-18 00:00:01 2015-09-18 13:00:00
41 AMG-999 2015-09-17 15:42:00 2015-09-17 23:59:59
61 W3188 2015-09-17 09:00:00 2015-09-17 23:59:59
2 Y80901 2015-09-20 00:00:01 2015-09-20 11:00:00
211 Y80901 2015-09-19 00:00:01 2015-09-19 23:59:59
611 W3188 2015-09-18 00:00:01 2015-09-18 23:59:59
612 W3188 2015-09-19 00:00:01 2015-09-19 23:59:59
6 W3188 2015-09-21 00:00:01 2015-09-21 12:00:00
613 W3188 2015-09-20 00:00:01 2015-09-20 23:59:59
数据操作
现在我们创建一个新的DF表示按规定格式的数据。这将在UsedTime列中初始为空值。
preparedData$duration <- preparedData$endtime - preparedData$starttime
noOfUniquePlates <- length(unique(preparedData$licensePlate))
Period <- rep(seq(from=(min(as.Date(preparedData$starttime))),to=(max(as.Date(preparedData$starttime))), by="day"),noOfUniquePlates)
noOfUniqueDays <- length(unique(Period))
LicensePlate <- rep(unique(preparedData$licensePlate),each=noOfUniqueDays)
UsedTime <- 0
newDF <- data.frame(Period,LicensePlate,UsedTime)
现在在newDF的每一行简单mapply功能,寻找在preparedData DF正确的使用数据。
findUsage <- function(p,l){
sum(preparedData[as.Date(preparedData$starttime) == p & as.Date(preparedData$endtime) == p & preparedData$licensePlate == l , ]$duration)
}
newDF$UsedTime <- mapply(findUsage, newDF$Period, newDF$LicensePlate)
newDF$UsedTime <- newDF$UsedTime/60
> newDF[with(newDF,order(Period)),]
Period LicensePlate UsedTime
1 2015-09-17 Y80901 0.000000
6 2015-09-17 AMG-999 8.299722
11 2015-09-17 W3188 14.999722
2 2015-09-18 Y80901 8.999722
7 2015-09-18 AMG-999 12.999722
12 2015-09-18 W3188 23.999444
3 2015-09-19 Y80901 23.999444
8 2015-09-19 AMG-999 0.000000
13 2015-09-19 W3188 23.999444
4 2015-09-20 Y80901 19.999444
9 2015-09-20 AMG-999 0.000000
14 2015-09-20 W3188 23.999444
5 2015-09-21 Y80901 11.999722
10 2015-09-21 AMG-999 5.000000
15 2015-09-21 W3188 13.999722
我不得不限制解释以保持答案的简短。如果您在评论中需要澄清,请告诉我。
谢谢,data.table真棒的例子,只是几行代码解决了我的问题 –