2014-01-19 89 views
1

这是我的数据框的样子。使用sqldf包创建时间间隔

我想要创建15分钟或30分钟的时间间隔,并且在该时间间隔内的所有时间戳的总和为No_Words。我需要这个来绘制每个时间间隔的平均字数。

我应该怎么办?

此外,我真的想知道如果使用sqldf包的解决方案是可能的。

   Time     No_Words 
1 2013-11-17 13:37:00     6  
2 2013-11-17 13:37:00     16  
3 2013-11-17 13:37:00     18  
4 2013-11-17 13:37:00     12  
5 2013-11-17 14:03:00     5  
6 2013-11-17 14:03:00     20  
7 2013-11-17 14:04:00     4  
8 2013-11-17 17:21:00     39  
9 2013-11-17 22:48:00     19  
10 2013-11-17 22:48:00     12  

回答

1
# generate example data, 30 min intervals 
set.seed(1) 
dateseq <- seq(as.POSIXct("2013-11-17"), as.POSIXct("2013-11-18"), by="min") 
df <- data.frame(Time=dateseq[sample(1:length(dateseq), 500)], 
       No_Words=sample(1:100, 500, replace=T)) 
groups <- cut.POSIXt(df$Time, breaks="30 min") 

使用sqldf难的方法:

library(sqldf) 
df$groups <- groups 
agg <- sqldf("select groups, avg(No_Words) from df group by groups", row.names=T) 
row.names(agg) <- agg[,1] 
agg <- as.matrix(agg) 
class(agg) <- "numeric" 
par(mar=c(2,10,0,0)); barplot(agg[,2], horiz=TRUE, las=1) 

简单的方法使用例如tapply

agg <- tapply(df$No_Words, list(groups), mean) 
par(mar=c(2,10,0,0)); barplot(agg, horiz=TRUE, las=1) 
1

这个答案不与sqldf,但与基础R功能aggregatecut

## If your "Time" column is not an actual time object, 
## convert it to one before proceeding. 
mydf$Time <- as.POSIXct(mydf$Time) 

cut可以创建时间仓。我们将使用它来完成我们的聚合。您可以使用formula符号,但我已经使用了list方法,使其很容易指定生成的列名:

## Aggregate data in 30 minute chunks 
aggregate(list(No_Words = mydf$No_Words), 
      list(Time = cut(mydf$Time, "30 min")), FUN = mean) 
#     Time No_Words 
# 1 2013-11-17 13:37:00 11.57143 
# 2 2013-11-17 17:07:00 39.00000 
# 3 2013-11-17 22:37:00 15.50000 

## Aggregate data into 15 minute chunks 
aggregate(list(No_Words = mydf$No_Words), 
      list(Time = cut(mydf$Time, "15 min")), FUN = mean) 
#     Time No_Words 
# 1 2013-11-17 13:37:00 13.000000 
# 2 2013-11-17 13:52:00 9.666667 
# 3 2013-11-17 17:07:00 39.000000 
# 4 2013-11-17 22:37:00 15.500000 
2

sqldf下面是一个sqldf解决方案,其中输入数据帧是DF

library(sqldf) 

min15 <- 15 * 60 # in seconds 
ans <- fn$sqldf("select 
     t.Time - t.Time % $min15 as Time, 
     sum(t.No_Words) as No_Words 
    from DF t 
    group by Time") 
plot(No_Words ~ Time, ans, type = "o") 

给予:

> ans 
       Time No_Words 
1 2013-11-17 13:30:00  52 
2 2013-11-17 14:00:00  29 
3 2013-11-17 17:15:00  39 
4 2013-11-17 22:45:00  31 

随着致密网格如果致密网格我们需要一个网格数据框,G,这是我们与现有ans加入(注意sqldf在克隆氏病包,所以我们使用它的trunc功能拉):

# create grid G 
rng <- range(as.POSIXct(trunc(as.chron(DF$Time), 15/(24 * 60)))) 
G <- data.frame(Time = seq(rng[1], rng[2], by = min15)) 

ans2 <- sqldf("select Time, coalesce(No_Words, 0) as No_Words 
     from (select * from G left join ans using(Time))") 
plot(No_Words ~ Time, ans2, type = "o") 

ans2前几行是:

> head(ans2) 

       Time No_Words 
1 2013-11-17 13:30:00  52 
2 2013-11-17 13:45:00  0 
3 2013-11-17 14:00:00  29 
4 2013-11-17 14:15:00  0 
5 2013-11-17 14:30:00  0 
6 2013-11-17 14:45:00  0 

动物园我们还表明动物园的解决方案:

library(zoo) 
library(chron) 
FUN <- function(x) as.POSIXct(trunc(as.chron(x), 15/(24 * 60))) 
z <- read.zoo(DF, FUN = FUN, aggregate = sum) 
plot(z) 

这给了z

> z 
2013-11-17 13:30:00 2013-11-17 14:00:00 2013-11-17 17:15:00 2013-11-17 22:45:00 
      52     29     39     31 

注:我们用这个数据,尤其是Time"POSIXct"的:

Lines<- " Time   No_Words 
1 2013-11-17 13:37:00     6  
2 2013-11-17 13:37:00     16  
3 2013-11-17 13:37:00     18  
4 2013-11-17 13:37:00     12  
5 2013-11-17 14:03:00     5  
6 2013-11-17 14:03:00     20  
7 2013-11-17 14:04:00     4  
8 2013-11-17 17:21:00     39  
9 2013-11-17 22:48:00     19  
10 2013-11-17 22:48:00     12 
" 

raw <- read.table(text = Lines, skip = 1) 
DF <- data.frame(Time = as.POSIXct(paste(raw$V2, raw$V3)), No_Words = raw$V4) 
+0

+1所有的变化! – A5C1D2H2I1M1N2O1R2T1