2017-03-27 32 views
1

假设我有两个表:timeperiod1timeperiod2Hive collect_set()

timeperiod1有模式像这样:

cluster characteristic 
A  1 
A  2 
A  3 
B  2 
B  3 

timeperiod2具有像这样的模式:

cluster characteristic 
A  1 
A  2 
B  2 
B  3 
B  4 

我要计算由集群中的两个时间周期之间的差集(即表) 。我的计划(请让我知道任何更好的方法)这样做是1)collect_set(我知道如何做到这一点),然后2)比较set_difference(我不知道如何做到这一点)。

1) 我做的:

CREATE TABLE collect_char_wk1 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod1 
GROUP BY cluster; 

CREATE TABLE collect_char_wk2 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod2 
GROUP BY cluster; 

获得collect_char_wk1

cluster characteristic 
A  [1,2,3] 
B  [2,3] 

,并获得collect_char_wk2

cluster characteristic 
A  [1,2] 
B  [2,3,4] 

2) 是否有一个蜂巢的功能,我可以用来计算集合差异?我不熟悉Java编写我自己的set_diff()Hive UDF/UDAF。

结果应该是一个表,set_diff_wk1_to_wk2

cluster set_diff 
A  1 
B  0 

上面是一个玩具例如,我的实际数据是对数百亿行与多个列的规模,因此,在计算上有效的解决方案是需要。我的数据存储在HDFS中,我使用的是HiveQL + Python。

回答

1

如果您正在尝试获取period1中不属于period2的每个群集的特征数,则可以简单地使用left joingroup by

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff 
from timeperiod1 t1 
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic 
group by t1.cluster 
+0

出于好奇,是否比使用collect_set()更快?看起来LEFT JOIN需要很长时间,并且可以减少行数,而collect_set()方法可以显着减少行数。我在上面添加了一个说明,详细说明我正在处理数十亿行数据(约300亿),所以最小化查询时间是理想的。 – user2205916

+0

@ user2205916 ..试试你的数据并检查运行时间。很难说哪种方法会更快。 –

1
select  cluster 

      ,count(*)           as count_total_characteristic 
      ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2 
      ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1 
      ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2 

      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2 
      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1 
      ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2 

from  (select  cluster 
         ,characteristic 
         ,max(case when tab = 1 then 1 else 0 end) as in_1 
         ,max(case when tab = 2 then 1 else 0 end) as in_2 

      from  (   select 1 as tab,cluster,characteristic from timeperiod1 
         union all select 2 as tab,cluster,characteristic from timeperiod2 
         ) t 

      group by cluster 
         ,characteristic 
      ) t 

group by cluster 

order by cluster 
; 

+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| cluster | count_total_characteristic | count_both_1_and_2 | count_only_in_1 | count_only_in_2 | both_1_and_2 | only_in_1 | only_in_2 | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| A  |       3 |     2 |    1 |    0 | [1,2]  | [3]  | []  | 
| B  |       3 |     2 |    0 |    1 | [2,3]  | []  | [4]  | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+