2017-05-03 61 views
-1

我有HIVE表(详情如下):问题与HIVE与ROW_NUMBER()OVER()语法

hive> select * from abcd ; 
OK 
a 1 1 
b 2 2 
a 3 3 
Time taken: 0.261 seconds, Fetched: 3 row(s) 
hive> desc abcd; 
OK 
val001     string          
val002     int           
val003     int           
Time taken: 0.084 seconds, Fetched: 3 row(s) 

我写下面的查询,但收到以下错误:

select max(rnk) rnk, max(val) val, sum(cnt) cnt from (select val, count(*) cnt, row_number() over (order by case val when null then 0 else count(*) end desc, val) rnk from (select VAL001 val from abcd) group by val) group by case when rnk <= 100 or val is null then rnk else 100 + 1 end; 

FAILED: ParseException line 3:55 missing) at 'by' near 'by' 
line 3:58 missing EOF at 'val' near 'by' 

我要找对于以上查询结果如下:

RNK VAL    CNT 
--- ------------------------------ --- 
1 a     2 
2 b     1 

我能够通过Oracle数据库实现相同种类的选项卡乐。唯一的区别是我不是通过Oracle DB中的解码顺序来使用顺序,而是因为在HIVe中不支持解码,所以我不能这样做。

请发现这是工作的Oracle数据库SQL查询:

SQL> select max(rnk) rnk, max(val) val, sum(cnt) cnt from 
    (select val, count(*) cnt, row_number() over (order by 
    decode(val,null,0,count(*)) desc, val) rnk from (select VAL001 val from 
    table_name) group by val) 
    group by case when rnk <= 100 or val is null then rnk else 100 + 1 end; 

RNK VAL    CNT 
--- ------------------------------ --- 
1 a      2 
2 b      1 

谁能帮我固定HIVE查询。让我知道你是否需要更多细节。

回答

1

这是你的查询。我怀疑还有一个更简单的方式来获得你想要的东西:

select max(rnk) as rnk, max(val) as val, sum(cnt) as cnt 
from (select val, count(*) as cnt, 
      row_number() over (order by case val when null then 0 else count(*) end desc, val) as rnk 
     from (select VAL001 val from abcd) 
     group by val 
    ) 
group by case when rnk <= 100 or val is null then rnk else 100 + 1 end; 

我想你只需要from子句中的子查询表的别名:

select max(rnk) as rnk, max(val) as val, sum(cnt) as cnt 
from (select val, count(*) as cnt, 
      row_number() over (order by case val when null then 0 else count(*) end desc, val) as rnk 
     from (select VAL001 val from abcd 
      ) x 
     group by val 
    ) x 
group by case when rnk <= 100 or val is null then rnk else 100 + 1 end; 
+0

这有助于..非常感谢:)...你对这个查询的简单版本也有建议。 – HiveRLearner

+0

您的意思是单次查询以实现结果?这将带来额外的荣誉:) –

0

这不是技术上简单的解决方案,但可能更容易阅读:

第一子查询进行计数和排名,

第二子查询中的分类top 1 - top 100和特殊类别other (top)unknown

最终查询进行分组。

with cnt as (
select VAL001 val, 
    count(*) as cnt, 
    row_number() over (order by decode(VAL001,null,0,count(*)) desc, VAL001) as rnk 
from abcd 
group by VAL001), 
ctg as (
select 
    val, cnt, rnk, 
    case when val is NULL then 'unknown' 
     when rnk <= 100 then 'top '||rnk 
     else 'other' end as category_code 
from cnt) 
select 
    max(rnk) as rnk, max(val) as val, sum(cnt) as cnt 
from ctg 
group by category_code 
order by 1