的R - 在折线图的项目数：比赛日期时间来算项目

的我有以下结构的数据帧：的R - 在折线图的项目数：比赛日期时间来算项目

df <- structure(list(Name = structure(1:9, .Label = c("task 1", "task 2", 
"task 3", "task 4", "task 5", "task 6", "task 7", "task 8", "task 9" 
), class = "factor"), Start = structure(c(1479799800, 1479800100, 
1479800400, 1479800700, 1479801000, 1479801300, 1479801600, 1479801900, 
1479802200), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1479801072, 
1479800892, 1479801492, 1479802092, 1479802692, 1479803292, 1479803892, 
1479804492, 1479805092), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Name", 
"Start", "End"), row.names = c(NA, -9L), class = "data.frame")

现在我想在列“名称”来算的项目随着时间的推移。他们都有一个开始和结束的日期时间，这是形成POSIXct。

随着this solution here on SO帮助，我是能够做到这一点（或至少我认为我是）与下面的代码：

library(data.table) 
setDT(df) 
dates = seq(min(df$Start), max(df$End), by = "min") 
lookup = data.table(Start = dates, End = dates, key = c("Start", "End")) 
ans = foverlaps(df, lookup, type = "any", which = TRUE) 
library(ggplot2) 
ggplot(ans[, .N, by = yid], aes(x = yid, y = N)) + geom_line()

现在问题：

如何符合我的日期时间缩放到x轴上的整数值？或者有更快更好的解决方案来解决我的问题吗？

我试图在ggplot()的aes内使用x = as.POSIXct(yid, format = "%Y-%m-%dT%H:%M:%S", origin = min(df$Start))。但那并不奏效。

编辑：

当使用此问题的解决方案，我面对另一个。在没有计数的情况下，图中显示最新可计数项目的计数。这就是为什么我们必须再次将表与计数（ants）合并（左加入），并将所有日期时间的完整序列合并，并且每NA加上0。所以我们为每个必要的数据点获得明确的值。像这样：

# The part we use to count and match the right times 
df1 <- ans[, .N, by = yid] %>% 
mutate(time = min(df$Start) + minutes(yid)) 
# The part where we use the sequence from the beginning for a LEFT JOIN with the counting dataframe 
df2 <- data.frame(time = dates) 
dt <- merge(x = df2, y = df1, by = "time", all.x = TRUE) 
dt[is.na(dt)] <- 0

来源

2016-11-28 Talvien

在tidyverse框架，这是一个稍微不同的任务 -

生成萨姆斯dates变量，你必须。
建设与所有日期的数据帧和所有时间（笛卡儿连接）
过滤掉不在区间为每个任务
添加了任务，为保持
地块每分钟行。

这看起来是这样的 - 如果你需要的是更快

library(tidyverse) 
library(lubridate) 

dates = seq(min(df$Start), max(df$End), by = "min") 
df %>% 
    mutate(key = 1) %>% 
    left_join(data_frame(key = 1, times = dates)) %>% 
    mutate(include = times %within% interval(Start, End)) %>% 
    filter(include) %>% 
    group_by(times) %>% 
    summarise(count = n()) %>% 
    ggplot(aes(times, count)) + 
    geom_line() 
#> Joining, by = "key"

，它几乎肯定会更快使用原来的data.table代码。

请考虑这一点。

library(data.table) 
setDT(df) 
dates = seq(min(df$Start), max(df$End), by = "min") 
lookup = data.table(Start = dates, End = dates, key = c("Start", "End")) 
ans = foverlaps(df, lookup, type = "any", which = TRUE) 

ans[, .N, by = yid] %>% 
    mutate(time = min(df$Start) + minutes(yid)) %>% 
    ggplot(aes(time, N)) + 
    geom_line()

现在我们使用data.table来计算重叠，然后索引时间关闭起始分钟。一旦我们添加一个新的列与时间，我们可以绘制。

来源

2016-11-28 15:25:38

似乎工作，谢谢！唯一的问题是，当我插入原始数据集时，它会杀死我的笔记本：D我想我必须将时间范围从几分钟扩大到30分钟，以确保连接不会分配太多的RAM。 – Talvien

那么，这是最昂贵的方式来做到这一点。技术上有更快的方法....让我看看。 –

@Talvien我用'data.table'添加了一个例子。 –

的R - 在折线图的项目数：比赛日期时间来算项目

回答

相关问题