2017-04-18 34 views
3

我的问题的简化版本是我有一个包含以下字段的表:id,时间戳和数字变量(速度)。我需要确定速度的平均值小于阈值(例如2)的时间段(开始和结束时间戳),但是时间段(结束时间戳 - 开始时间戳)至少是最小持续时间(例如5小时以上)。基本上,我需要计算初始5小时窗口的平均值,如果平均值小于阈值,则保留开始时间戳,并使用end_timestamp前进一行并重新计算平均值。如果新的平均值小于阈值,则再次向前推进,扩大时间窗口。如果新平均值大于阈值,则报告前一个end_timestamp为此窗口的end_timestamp,并启动一个新的start_timestamp,并计算另一个5小时的新平均值。最终,最终产品是一张表,其中包含一组start_timestamps,end_timestamps(以及计算的持续时间),平均速度小于2,开始和结束之间的时间至少为5小时。大查询SQL:确定符合条件的最小长度的时间范围

我正在使用Google Big Query: 以下是我迄今为止的一般结构,但似乎没有按照我的想法工作。首先,它只测试并报告最初5小时窗口的速度阈值,即使窗口增长。其次,它似乎没有适当地增长窗口。很少有窗口长于5个小时,尽管事实上在某些情况下查看我的数据应该是两倍。我希望有人试图开发出类似的分析,并可以揭示我的错在哪里。

SELECT 
*, 
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS 
next_start_timestamp, 
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS 
next_end_timestamp 
FROM (
SELECT 
*, 
IF(last_timestamp IS NULL 
    OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds 
IF(next_timestamp IS NULL 
    OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds 
FROM (
SELECT 
    *, 
    LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp, 
    LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp, 
FROM (
    SELECT 
    *, 
    AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period, 
    FROM (
     SELECT 
     id, 
     timestamp, 
     speed 
     FROM 
     [dataset.table1])) 
WHERE 
    avg_speed_last_period < 2 
ORDER BY 
    id, 
    timestamp) 
HAVING 
    start_timestamp 
    OR end_timestamp) 

编辑: 下面是一些sample_data的链接。鉴于这些数据,平均速度小于2至少5个小时的要求,输出表格的第一行会很有希望

ID start_event     end_event    average_speed duration_hrs 
203 2015-01-08 17:40:06 UTC 2015-01-09 07:09:35 UTC  0.7802  13.491 

203 2015-01-10 03:43:56 UTC 2015-01-10 08:48:57 UTC  1.452  5.083 
+0

样本数据和预期的效果倒很帮助解释。 –

+0

谢谢...添加示例数据和示例输出 –

+0

您仍然留下一些开放的“漏洞” - 请将第二行添加到预期的输出中 - 至少对于我来说它会关闭一些 –

回答

1

从您的CSV,我假设下面的架构

enter image description here

在它下面的数据:

enter image description here

考虑到这一点 - 下面是工作代码BigQuery的标准SQL
不正是您期待与输出什么

id     start_event     end_event average_speed duration_hrs 
203  2015-01-08 17:40:00 UTC 2015-01-09 07:09:00 UTC   0.78   13.48 
203  2015-01-10 03:43:00 UTC 2015-01-10 08:48:00 UTC   1.45   5.08 
#standardSQL 
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
    items ARRAY<STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>>, 
    min_length INT64, threshold FLOAT64, max_speed FLOAT64 
) 
RETURNS ARRAY<STRUCT<start_event TIMESTAMP, end_event TIMESTAMP, average_speed FLOAT64, duration_hrs FLOAT64>> 
LANGUAGE js AS """ 
    var result = []; 
    var initial = 0; 
    var candidate = items[initial].ts; 
    var len = 0; 
    var sum = 0; 
    for (i = 0; i < items.length; i++) { 
    len++; 
    sum += items[i].speed 

    if (items[i].ts - candidate < min_length) { 
     if (items[i].speed > max_speed) { 
     initial = i + 1; 
     candidate = items[initial].ts; 
     len = 0; 
     sum = 0; 
     }  
     continue; 
    } 

    if (sum/len > threshold || items[i].speed > max_speed) { 
     avg_speed = (sum - items[i].speed)/(len - 1); 
     if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) { 
     var o = []; 
     o.start_event = items[initial].datetime; 
     o.average_speed = avg_speed.toFixed(3); 
     o.end_event = items[i - 1].datetime; 
     o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3) 
     result.push(o) 
     } 
     initial = i; 
     candidate = items[initial].ts; 
     len = 1; 
     sum = items[initial].speed; 
    } 

    }; 

    return result; 
"""; 

WITH data AS (
    SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed 
    FROM `yourTable` 
), compact_data AS (
    SELECT id, ARRAY_AGG(STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points 
    FROM data 
    GROUP BY id 
) 
SELECT 
    id, start_event, end_event, average_speed, duration_hrs 
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment 
ORDER BY id, start_event 

请注意:此代码使用User-Defined Functions这意味着一些limitsquotascost hit你要看你的数据

的大小

还要记住 - 如果datetime字段的数据类型不是STRING - 则只需要稍微调整data subquery - 其余的应该保留原样!

例如,如果日期时间是TIMESTAMP数据类型的 - 你只需要更换

SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed 
    FROM `yourTable` 

SELECT id, datetime, speed 
    FROM `yourTable` 

希望你喜欢它:O)