2017-05-03 53 views
1

我有一个星期几和一天中的小时数的美元价值列表(这是从时间戳,所以我只做dayOfWeek和hourOfDay为1周)SQL(BigQuery) - 在整个时间查找不寻常的值序列

Id | dayOfWeek | hourOfDay | dollars 
1  1   1   0 
1  1   2   0 
1  1   3   0 
1  1   4   0 
1  1   5   6 
1  1   6   5 
1  1   7   7 
1  1   8   18 
1  1   9   13 
1  1   10   19 
1  1   11   18 
1  1   12   13 
1  1   13   19 
1  1   14   10 
1  1   15   16 
1  1   16   15 
1  1   17   17 
1  1   18   18 
1  1   19   13 
1  1   20   0 
1  1   21   0 
1  1   22   0 
1  1   23   0 
1  2   1   0 
1  2   2   0 
1  2   3   0 
1  2   4   0 
1  2   5   16 
1  2   6   15 
1  2   7   27 
1  2   8   11 
1  2   9   13 
1  2   10   11 
1  2   11   18 
1  2   12   14 
1  2   13   14 
1  2   14   10 
1  2   15   16 
1  2   16   15 
1  2   17   17 
1  2   18   18 
1  2   19   13 
1  2   20   10 
1  2   21   22 
1  2   22   0 
1  2   23   0 

我想找到高于平均连续0的日期结束时的Ids。我正在考虑使用诸如percent_rank()之类的方法来查找“高于平均水平”的情况,但我无法将每个Id的0个情况的连续实例组合在一起。

任何帮助将非常感激,但请让我知道,如果我没有正确思考方式,或者如果我应该考虑一个不同的方向。非常感谢。

+0

什么是'平均连续的0朝向day'的结束?顺便说一句,如果您可以编辑您的问题以显示您遇到问题的代码的[最小,完整和可验证示例](http://stackoverflow.com/help/mcve),那将是非常好的,那么我们可以尝试帮助解决具体问题。你也可以阅读[如何问](http://stackoverflow.com/help/how-to-ask)。 –

+0

例如,通常有1-2个连续的0(例如小时22,23 = 0),但是我想要捕获如上所述的实例(dayOfWeek = 1),其中有4个连续的0(小时20,21,22,23 )。我有道理吗?正式 - –

+0

- 现在有道理。希望这从商业的角度来看也是有道理的:o) –

回答

3

下面是BigQuery的标准SQL

#standardSQL 
WITH outages AS (
    SELECT 
    id, 
    MIN(dayOfWeek) AS dayOfWeek, 
    MIN(hourOfDay) AS hourOfDay, 
    COUNT(1) AS len 
    FROM (
    SELECT 
     id, seq, 
     FIRST_VALUE(dayOfWeek) OVER(win) AS dayOfWeek, 
     FIRST_VALUE(hourOfDay) OVER(win) AS hourOfDay 
    FROM (
     SELECT 
     id, dayOfWeek, hourOfDay, dollars, 
     COUNTIF(dollars <> 0) OVER(PARTITION BY id ORDER BY dayOfWeek, hourOfDay) AS seq 
     FROM `yourTable` 
    ) 
    WHERE dollars = 0 
    WINDOW win AS (PARTITION BY id, seq ORDER BY dayOfWeek, hourOfDay) 
) 
    GROUP BY id, seq 
), 
averages AS (
    SELECT id, AVG(len) AS len 
    FROM outages 
    GROUP BY id 
) 
SELECT o.* 
FROM outages AS o JOIN averages AS a 
ON o.id = a.id AND o.len > a.len 

您可以测试/使用其虚拟数据从你的问题如下

#standardSQL 
WITH yourTable AS (
    SELECT * FROM UNNEST([STRUCT<id INT64, dayOfWeek INT64, hourOfDay INT64, dollars INT64>(1, 1, 1, 0),(1, 1, 2, 0),(1, 1, 3, 0),(1, 1, 4, 0),(1, 1, 5, 6),(1, 1, 6, 5),(1, 1, 7, 7),(1, 1, 8, 18),(1, 1, 9, 13),(1, 1, 10, 19),(1, 1, 11, 18),(1, 1, 12, 13),(1, 1, 13, 19),(1, 1, 14, 10),(1, 1, 15, 16),(1, 1, 16, 15),(1, 1, 17, 17),(1, 1, 18, 18),(1, 1, 19, 13),(1, 1, 20, 0),(1, 1, 21, 0),(1, 1, 22, 0),(1, 1, 23, 0),(1, 2, 0, 0),(1, 2, 1, 0),(1, 2, 2, 0),(1, 2, 3, 0),(1, 2, 4, 0),(1, 2, 5, 16),(1, 2, 6, 15),(1, 2, 7, 27),(1, 2, 8, 11),(1, 2, 9, 13),(1, 2, 10, 11),(1, 2, 11, 18),(1, 2, 12, 14),(1, 2, 13, 14),(1, 2, 14, 10),(1, 2, 15, 16),(1, 2, 16, 15),(1, 2, 17, 17),(1, 2, 18, 18),(1, 2, 19, 13),(1, 2, 20, 10),(1, 2, 21, 22),(1, 2, 22, 0),(1, 2, 23, 0)]) 
), 
outages AS (
    SELECT 
    id, 
    MIN(dayOfWeek) AS dayOfWeek, 
    MIN(hourOfDay) AS hourOfDay, 
    COUNT(1) AS len 
    FROM (
    SELECT 
     id, seq, 
     FIRST_VALUE(dayOfWeek) OVER(win) AS dayOfWeek, 
     FIRST_VALUE(hourOfDay) OVER(win) AS hourOfDay 
    FROM (
     SELECT 
     id, dayOfWeek, hourOfDay, dollars, 
     COUNTIF(dollars <> 0) OVER(PARTITION BY id ORDER BY dayOfWeek, hourOfDay) AS seq 
     FROM `yourTable` 
    ) 
    WHERE dollars = 0 
    WINDOW win AS (PARTITION BY id, seq ORDER BY dayOfWeek, hourOfDay) 
) 
    GROUP BY id, seq 
), 
averages AS (
    SELECT id, AVG(len) AS len 
    FROM outages 
    GROUP BY id 
) 
SELECT o.* 
FROM outages AS o JOIN averages AS a 
ON o.id = a.id AND o.len > a.len 

正如你可以看到这里玩 - outages子选择计算所有具有序列长度和该序列开始的零序列,并输出如下

id dayOfWeek hourOfDay len 
1 1   1   4  
1 1   20   9  
1 2   22   2  

最终选择输出从中断只有行,其中各长度大于平均长度(从averages子选择),选择那些ID

id dayOfWeek hourOfDay len 
1 1   20   9  
+0

在'window'操作之后,可能缺少一个“)”。只是想知道,在BQ的文档中有没有讨论这种_window_技术的地方?我第一次看到它,真的很喜欢它。和伟大的答案btw –

+0

谢谢@威尔 - 考虑投票它,如果你喜欢它:o) - 我检查“)”的东西,也将跟进与窗口链接 –

+0

@威尔 - 你是对的 - 不知何故我丢失“)”什么时候格式化答案。谢谢!现在寻找链接... –