更快速地计算5分钟内发生的事件？

我有一个矩阵，events，其中包含500万事件的发生次数。这500万个事件中的每一个都有一个“类型”，范围从1到2000.矩阵的一个非常简化的版本如下。 “时间”的单位是1970年以来的秒数。所有事件都发生在2012年1月1日以后。更快速地计算5分钟内发生的事件？

>events 
     type   times 
     1   1352861760 
     1   1362377700 
     2   1365491820 
     2   1368216180 
     2   1362088800 
     2   1362377700

我试图划分时间，因为1/1/2012到5分钟的桶，然后填充这些桶的使用已经发生了多少i类型的每个事件的每个桶中。我的代码如下。请注意0是一个包含1-2000的每种可能类型的矢量，并且by设置为300，因为这是5分钟内的多少秒。

for(i in 1:length(types)){ 
    local <- events[events$type==types[i],c("type", "times")] 
    assign(sprintf("a%d", i),table(cut(local$times, breaks=seq(range(events$times)[1],range(events$times)[2], by=300)))) 
}

这导致变量a1通过a2000其中包含如何i类型的许多出现有在每个5分钟的桶的行向量。

我开始然后找到“A1”和“A2000”之间的所有成对的相关性。

有没有办法来优化我上面提供的代码块？它运行得非常缓慢，但我想不出一种更快的方法。也许水桶太多，时间太少。

任何有识之士将不胜感激。

重复的例子：

>head(events) 
    type   times 
     12   1308575460 
     12   1308676680 
     12   1308825420 
     12   1309152660 
     12   1309879140 
     25   1309946460 

xevents <- xts(events[,"type"],.POSIXct(events[,"times"])) 
ep <- endpoints(xevents, "minutes", 5) 
counts <- period.apply(xevents, ep, tabulate, nbins=length(types)) 

>head(counts) 
         1 2 3 4 5 6 7 8 9 10 11 12 13 14 
2011-06-20 09:11:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-06-21 13:18:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-06-23 06:37:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-06-27 01:31:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-07-05 11:19:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-07-06 06:01:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

>> ep[1:20] 
[1] 0 1 2 3 4 5 6 7 8 9 10 12 20 21 22 23 24 25 26 27

以上就是我一直在使用的代码，但问题是，它没有被5分钟递增的：它只是由真实事件的发生增加。

来源

2013-07-24 user2588829

你的“可重现的例子”不是[reproducible]（http://stackoverflow.com/q/5963269/271616），而且你不显示你想要的输出但是我认为你需要每5分钟进行一次观察，无论你是否真的在那段时间内有数据。 –

我会为此使用xts包。使用period.apply和endpoints函数可轻松运行5分钟不重叠的功能。

# create sample data 
library(xts) 
set.seed(21) 
N <- 1e6 
events <- cbind(sample(2000, N, replace=TRUE), 
    as.POSIXct("2012-01-01")+sample(1e7,N)) 
colnames(events) <- c("type","times") 
# create xts object 
xevents <- xts(events[,"type"], .POSIXct(events[,"times"])) 
# find the last row of each non-overlapping 5-minute interval 
ep <- endpoints(xevents, "minutes", 5) 
# count the number of occurrences of each "type" 
counts <- period.apply(xevents, ep, tabulate, nbins=2000) 
# set colnames 
colnames(counts) <- paste0("a",1:ncol(counts)) 
# calculate correlation 
#cc <- cor(counts)

更新回应OP的意见/编辑：

# Create a sequence of 5-minute steps, from the actual start of the data 
m5 <- seq(round(start(xevents),'mins'), end(xevents), by='5 mins') 
# Create a sequence of 5-minute steps, from the start of 2012-01-01 
m5 <- seq(as.POSIXct("2012-01-01"), end(xevents), by='5 mins') 
# merge xevents with empty 5-minute xts object, and 
# subtract 1 second, so endpoints are at end of each 5-minute interval 
xevents5 <- merge(xevents, xts(,m5-1)) 
ep5 <- endpoints(xevents5, "minutes", 5) 
counts5 <- period.apply(xevents5, ep5, tabulate, nbins=2000) 
colnames(counts5) <- paste0("a",1:ncol(counts5)) 
# align to the beginning of each 5-minute interval, if you want 
counts5 <- align.time(counts5,60*5)

来源

2013-07-24 21:46:56

这段代码太棒了！直到现在，从来不知道xts库。然而，.POSIXct步骤会将我的日期转换为错误，导致错误计算......任何想法如何解决这个问题？ – user2588829

@ user2588829：如果你不那么模糊，我会想法如何解决这个问题......“把我的日期转换成错误的”并不告诉我。 –

好吧，使用.POSIXct函数转换它（我使用的确切函数是：as.POSIXct（strptime（x，format =“％m /％d /％y％H：％M：％S”）， tz =“GMT”），origin =“1970-01-01”）'）正在制作最初于2012年11月14日02:56进入1970-01-07 14:28:44的内容。 – user2588829

cut它在times的range之内，就像你做的那样。之后，您可以使用table或xtabs进行制表，但是对于整个数据集，可以生成一个矩阵。类似如下：

r <- trunc(range(events$times)/300) * 300 
events$times.bin <- cut(events$times, seq(r[1], r[2] + 300, by=300)) 
xtabs(~type+times.bin, events, drop.unused.levels=T)

决定是否要drop.unused.levels或不。有了这种数据，您可能还想创建一个sparse矩阵。

来源

2013-07-24 21:29:08 krlmlr

您是否尝试在500万行上运行此操作？我问，因为我的电脑被锁定，当我试图运行它在100万... –

@JoshuaUlrich：不，没有。你用过'稀疏= T'吗？ – krlmlr

拥有5万条记录，我可能会使用data.table。你可以这样做：

# First we make a sequence of the buckets from initial time to at least the end time + 5 minutes 
buckets <- seq(from = min(df$times) , by = 300 , to = max(df$times)+300) 

require(data.table) 
DT <- data.table(df) 

# Work out what bucket each time is in 
DT[ , list(Bucket = which.max(times <= buckets)) , by = "times" ] 

# Aggregate events by type and time bucket 
DT[ , list(Count = length(type)) , by = list(type, bucket) ] 
    type bucket Count 
1: 1  1  1 
2: 1 31721  1 
3: 2 42102  1 
4: 2 51183  1 
5: 2 30758  1 
6: 2 31721  1

来源

2013-07-24 21:59:38

更快速地计算5分钟内发生的事件？

回答

相关问题