2013-06-18 21 views
5

我有一个数据集,其中约500k约会时间在5到60分钟之间。如何计算大型数据集的每分钟发生次数

tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325496900, 1325496900, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300, 1325499600, 1325499600), class = c("POSIXct", "POSIXt"), tzone = "GMT"), Location = c("LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB"), Room = c("RoomA", "RoomA", "RoomA", "RoomA", "RoomB", "RoomB", "RoomB", "RoomB", "RoomB", "RoomB", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA")), .Names = c("Start", "End", "Location", "Room"), row.names = c(NA, 20L), class = "data.frame") 
> head(tdata) 
       Start     End Location Room 
1 2012-01-02 08:30:00 2012-01-02 08:40:00 LocationA RoomA 
2 2012-01-02 08:40:00 2012-01-02 08:50:00 LocationA RoomA 
3 2012-01-02 08:50:00 2012-01-02 08:55:00 LocationA RoomA 
4 2012-01-02 09:00:00 2012-01-02 09:10:00 LocationA RoomA 
5 2012-01-02 09:00:00 2012-01-02 09:10:00 LocationA RoomB 
6 2012-01-02 09:10:00 2012-01-02 09:20:00 LocationA RoomB 

我想计算数量的并发约会的总量,每个位置和每个房间(和其他一些因素去原始数据集)。

我一直在使用mysql包执行左连接,它适用于小数据集的尝试,但永远需要对整个数据集:

# SQL Join. 
start.min <- min(tdata$Start, na.rm=T) 
end.max <- max(tdata$End, na.rm=T) 
tinterval <- seq.POSIXt(start.min, end.max, by = "mins") 
tinterval <- as.data.frame(tinterval) 

library(sqldf) 
system.time(
    output <- sqldf("SELECT * 
       FROM tinterval 
       LEFT JOIN tdata 
       ON tinterval.tinterval >= tdata.Start 
       AND tinterval.tinterval < tdata.End ")) 

head(output) 
      tinterval    Start     End Location Room 
1 2012-01-02 09:30:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
2 2012-01-02 09:31:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
3 2012-01-02 09:32:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
4 2012-01-02 09:33:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
5 2012-01-02 09:34:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
6 2012-01-02 09:35:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 

它创建了一个数据帧,所有的“主动”约会每分钟都会列出。大型数据集涵盖全年(约525600分钟)。平均预约时间为18分钟,我预计sql join将创建一个约500万行的数据集,我可以使用它创建不同因素(位置/房间等)的占用情节。

建立在sapply解决方案建议在How to count number of concurrent users我尝试使用data.tablesnowfall如下:

require(snowfall) 
require(data.table) 
sfInit(par=T, cpu=4) 
sfLibrary(data.table) 

tdata <- data.table(tdata) 
tinterval <- seq.POSIXt(start.min, end.max, by = "mins") 
setkey(tdata, Start, End) 
sfExport("tdata") # "Transport" data to cores 

system.time(output <- data.frame(tinterval,sfSapply(tinterval, function(i) length(tdata[Start <= i & i < End,Start])))) 

> head(output) 
      tinterval sfSapply.tinterval..function.i..length.tdata.Start....i...i... 
1 2012-01-02 08:30:00                1 
2 2012-01-02 08:31:00                1 
3 2012-01-02 08:32:00                1 
4 2012-01-02 08:33:00                1 
5 2012-01-02 08:34:00                1 
6 2012-01-02 08:35:00                1 

该解决方案是快速的,大约需要18秒计算1天(满一年约2小时) 。缺点是我无法为某些因素(位置,房间等)创建多个并发约会的子集。我有这样的感觉,必须有更好的方式来做到这一点..任何建议?

UPDATE: 根据杰弗里的回答,最终解决方案看起来像这样。这个例子显示了每个地点的入住率是如何确定的。

setkey(tdata, Location, Start, End) 
vecTime <- seq(from=tdata$Start[1],to=tdata$End[nrow(tdata)],by=60) 
res <- data.frame(time=vecTime) 

for(i in 1:length(unique(tdata$Location))) { 
    addz <- array(0,length(vecTime)) 
    remz <- array(0,length(vecTime)) 

    tdata2 <- tdata[J(unique(tdata$Location)[i]),] # Subset a certain location. 

    startAgg <- aggregate(tdata2$Start,by=list(tdata2$Start),length) 
    endAgg <- aggregate(tdata2$End,by=list(tdata2$End),length) 
    addz[which(vecTime %in% startAgg$Group.1)] <- startAgg$x 
    remz[which(vecTime %in% endAgg$Group.1)] <- -endAgg$x 

    res[,c(unique(tdata$Location)[i])] <- cumsum(addz + remz) 
} 

> head(res) 
       time LocationA LocationB 
1 2012-01-01 03:30:00   1   0 
2 2012-01-01 03:31:00   1   0 
3 2012-01-01 03:32:00   1   0 
4 2012-01-01 03:33:00   1   0 
5 2012-01-01 03:34:00   1   0 
6 2012-01-01 03:35:00   1   0 
+0

很高兴提供有用的答案。只是一个指针。 – Arun

回答

3

这是否更好。

创建一个空白时间向量和一个空白计数向量。

vecTime <- seq(from=tdata$Start[1],to=tdata$End[nrow(tdata)],by=60) 
addz <- array(0,length(vecTime)) 
remz <- array(0,length(vecTime)) 


startAgg <- aggregate(tdata$Start,by=list(tdata$Start),length) 
endAgg <- aggregate(tdata$End,by=list(tdata$End),length) 
addz[which(vecTime %in% startAgg$Group.1)] <- startAgg$x 
remz[which(vecTime %in% endAgg$Group.1)] <- -endAgg$x 
res <- data.frame(time=vecTime,occupancy=cumsum(addz + remz)) 
+0

非常感谢杰弗里,但是这并不包括在某个时间段内活跃的约会数量。这告诉我有两个约会开始于9:00,但不考虑活动约会(已经开始但未结束)。我需要每分钟的入住率来研究真正繁忙时期的高峰。 – TimV

+0

Hee Goeffrey,您的解决方案花了9秒钟处理我的整个数据集。我一直在挣扎几个小时。非常感谢您的意见。我一直在寻找一个错误的方向:聚合所有约会的开始和结束时间并根据这个时间确定入住率真的很棒。考虑到计算的速度,我可以为每个位置或每个房间建立占用情节,并附上一些for循环,所以我认为我的问题得到了回答。 – TimV

0

我不完全确定,如果我理解你的目标。尽管如此,这可能是有用的:

#I changed the example to actually have concurrent appointments 
DF <- read.table(text="    Start,     End, Location, Room 
1, 2012-01-02 08:30:00, 2012-01-02 08:40:00, LocationA, RoomA 
2, 2012-01-02 08:40:00, 2012-01-02 08:50:00, LocationA, RoomA 
3, 2012-01-02 08:50:00, 2012-01-02 09:55:00, LocationA, RoomA 
4, 2012-01-02 09:00:00, 2012-01-02 09:10:00, LocationA, RoomA 
5, 2012-01-02 09:00:00, 2012-01-02 09:10:00, LocationA, RoomB 
6, 2012-01-02 09:10:00, 2012-01-02 09:20:00, LocationA, RoomB",header=TRUE,sep=",",stringsAsFactors=FALSE) 

DF$Start <- as.POSIXct(DF$Start,format="%Y-%d-%m %H:%M:%S",tz="GMT") 
DF$End <- as.POSIXct(DF$End,format="%Y-%d-%m %H:%M:%S",tz="GMT") 

library(data.table) 
DT <- data.table(DF) 
DT[,c("Start_num","End_num"):=lapply(.SD,as.numeric),.SDcols=1:2] 

fun <- function(s,e) { 
    require(intervals) 
    mat <- cbind(s,e) 
    inter <- Intervals(mat,closed=c(FALSE,FALSE),type="R") 
    io <- interval_overlap(inter, inter) 
    tablengths <- table(sapply(io,length))[-1] 
    sum(c(0,as.vector(tablengths/as.integer(names(tablengths))))) 
} 

#number of overlapping events per room and location 
DT[,fun(Start_num,End_num),by=list(Location,Room)] 
#  Location Room V1 
#1: LocationA RoomA 1 
#2: LocationA RoomB 0 

我没有测试这个,特别是对于速度。

+0

谢谢罗兰。有趣的方法,但我一直在寻找每分钟的总入住率,并能够为地点和房间分配住房。 – TimV

0

下面是一个策略 - 按开始时间排序,然后通过开始,结束,开始,结束......取消数据并查看该向量是否需要重新排序。如果没有,那么就没有冲突,如果有的话,你可以看到有多少个约会(以及如果你喜欢哪个约会)相互冲突。

# Using Roland's example: 
DF <- read.table(text="    Start,     End, Location, Room 
1,2012-01-02 08:30:00,2012-01-02 08:40:00,LocationA,RoomA 
2,2012-01-02 08:40:00,2012-01-02 08:50:00,LocationA,RoomA 
3,2012-01-02 08:50:00,2012-01-02 09:55:00,LocationA,RoomA 
4,2012-01-02 09:00:00,2012-01-02 09:10:00,LocationA,RoomA 
5,2012-01-02 09:00:00,2012-01-02 09:10:00,LocationA,RoomB 
6,2012-01-02 09:10:00,2012-01-02 09:20:00,LocationA,RoomB",header=TRUE,sep=",",stringsAsFactors=FALSE) 

dt = data.table(DF) 

# the conflicting appointments 
dt[order(Start), 
    .SD[unique((which(order(c(rbind(Start, End))) != 1:(2*.N)) - 1) %/% 2 + 1)], 
    by = list(Location, Room)] 
# Location Room    Start     End 
#1: LocationA RoomA 2012-01-02 08:50:00 2012-01-02 09:55:00 
#2: LocationA RoomA 2012-01-02 09:00:00 2012-01-02 09:10:00 

# and a speedier version of the above, that avoids constructing the full .SD: 
dt[dt[order(Start), 
     .I[unique((which(order(c(rbind(Start, End))) != 1:(2*.N)) - 1) %/% 2 + 1)], 
     by = list(Location, Room)]$V1] 

也许从无与伦比为了纠正上述指标去公式可以简化,我并没有花太多时间考虑这个问题,只是使用的完成了任务的第一件事情。