Hive collect_set（）

假设我有两个表：timeperiod1和timeperiod2。Hive collect_set（）

timeperiod1有模式像这样：

cluster characteristic 
A  1 
A  2 
A  3 
B  2 
B  3

timeperiod2具有像这样的模式：

cluster characteristic 
A  1 
A  2 
B  2 
B  3 
B  4

我要计算由集群中的两个时间周期之间的差集（即表）。我的计划（请让我知道任何更好的方法）这样做是1）collect_set（我知道如何做到这一点），然后2）比较set_difference（我不知道如何做到这一点）。

1）我做的：

CREATE TABLE collect_char_wk1 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod1 
GROUP BY cluster; 

CREATE TABLE collect_char_wk2 STORED AS ORC AS 
SELECT cluster, COLLECT_SET(characteristic) 
FROM timeperiod2 
GROUP BY cluster;

获得collect_char_wk1：

cluster characteristic 
A  [1,2,3] 
B  [2,3]

，并获得collect_char_wk2：

cluster characteristic 
A  [1,2] 
B  [2,3,4]

2）是否有一个蜂巢的功能，我可以用来计算集合差异？我不熟悉Java编写我自己的set_diff（）Hive UDF/UDAF。

结果应该是一个表，set_diff_wk1_to_wk2：

cluster set_diff 
A  1 
B  0

上面是一个玩具例如，我的实际数据是对数百亿行与多个列的规模，因此，在计算上有效的解决方案是需要。我的数据存储在HDFS中，我使用的是HiveQL + Python。

来源

2017-03-27 user2205916

如果您正在尝试获取period1中不属于period2的每个群集的特征数，则可以简单地使用left join和group by。

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff 
from timeperiod1 t1 
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic 
group by t1.cluster

来源

2017-03-27 19:50:12

出于好奇，是否比使用collect_set（）更快？看起来LEFT JOIN需要很长时间，并且可以减少行数，而collect_set（）方法可以显着减少行数。我在上面添加了一个说明，详细说明我正在处理数十亿行数据（约300亿），所以最小化查询时间是理想的。 – user2205916

@ user2205916 ..试试你的数据并检查运行时间。很难说哪种方法会更快。 –

select  cluster 

      ,count(*)           as count_total_characteristic 
      ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2 
      ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1 
      ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2 

      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2 
      ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1 
      ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2 

from  (select  cluster 
         ,characteristic 
         ,max(case when tab = 1 then 1 else 0 end) as in_1 
         ,max(case when tab = 2 then 1 else 0 end) as in_2 

      from  (   select 1 as tab,cluster,characteristic from timeperiod1 
         union all select 2 as tab,cluster,characteristic from timeperiod2 
         ) t 

      group by cluster 
         ,characteristic 
      ) t 

group by cluster 

order by cluster 
;

+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| cluster | count_total_characteristic | count_both_1_and_2 | count_only_in_1 | count_only_in_2 | both_1_and_2 | only_in_1 | only_in_2 | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+ 
| A  |       3 |     2 |    1 |    0 | [1,2]  | [3]  | []  | 
| B  |       3 |     2 |    0 |    1 | [2,3]  | []  | [4]  | 
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+

来源

2017-03-27 20:56:50

您可以使用brickhouse UDF的它有很多功能将执行你所描述的操作。更具体地说，您可以使用set_diff在Wiki中解释。 README文件将指导您如何创建jar文件。

您可以在您的查询的jar文件：

ADD jar /PATH/TO/JARFILE/brickhouse-<VERSIONS>-SNAPSHOT.jar

然后使用本指南访问功能： https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql

希望这有助于。

来源

2017-04-20 00:54:14 DrV

Hive collect_set（）

回答

相关问题