2014-09-06 94 views
0

获得百分比我有一个表,如下所示:从计数蜂巢

COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 
e-12 1101 201408110525 Arts and Entertainment Television 
e-12 1101 201408110525 Arts and Entertainment Television 
e-12 1101 201408110525 Arts and Entertainment Television 
e-12 1101 201408110620 Technology and Computing Internet Technology 
e-12 1101 201408110705 Technology and Computing Antivirus Software 
e-12 1107 201408110510 Business Advertising 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1107 201408110520 Business Marketing 
e-12 1109 201408110505 Technology and Computing Web Search 

忽视COL1(因为他们都是一样的),为每一位COL2,有其余字段的组合。我设法重复组合的数量,从而产生以下:

COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT 
e-12 1101 201408110525 Arts and Entertainment Television 3 
e-12 1101 201408110620 Technology and Computing Internet Technology 1 
e-12 1101 201408110705 Technology and Computing Antivirus Software 1 
e-12 1107 201408110510 Business Advertising 1 
e-12 1107 201408110520 Business Marketing 7 
e-12 1109 201408110505 Technology and Computing Web Search 1 

如何转数为每COL2所有组合的百分比是多少?

我很抱歉,我不能更好地把这个词,但输出应该是这样的:

COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT PERCENTAGE 
e-12 1101 201408110525 Arts and Entertainment Television 3 60% 
e-12 1101 201408110620 Technology and Computing Internet Technology 1 20% 
e-12 1101 201408110705 Technology and Computing Antivirus Software 1 20% 
e-12 1107 201408110510 Business Advertising 1 12.5% 
e-12 1107 201408110520 Business Marketing 7 87.5% 
e-12 1109 201408110505 Technology and Computing Web Search 1 100% 

注:在这一点上,计数是没有必要的。

这甚至可能在蜂巢?我如何修改我的计数查询(下)以输出最后一个表?

SELECT COL1, COL2, DATETIMESTAMP, CATEGORY1, CATEGORY2, count(*) FROM temp_table GROUP BY CATEGORY1, CATEGORY2, DATETIMESTAMP, COL2, COL1 SORT BY COL2; 

谢谢。

+0

你可以指望的COL2和产品组别分别使用两个SELECT语句,然后在主SELECT语句 – 2014-09-06 18:30:17

回答

1

我可以考虑几种方法来做到这一点。您可以计算您的百分比中的分母,然后将其加回到原始数据中,然后除以总数得到SUM。此外,如果您有权访问Hive中的windowing functions(我相信它们的发货时间为0.13),则可以使用SELECT中的OVERPARTITION语句来避免第一部分中描述的联接。

#1:

select col2, cat1, cat2, datetimestamp 
    ,(COUNT(cat2)/MAX(total_)) as perc 
from (
    select n.col2, cat1, cat2, datetimestamp, x.total_ 
    from some_table as n 
    JOIN (
     select col2, COUNT(col2) as total_ 
     from some_table 
     group by col2 
     ) x 
    ON x.col2 = n.col2 
    ) y 
group by cat1, cat2, col2, datetimestamp 

#2:

select col2, cat1, cat2, datetimestamp 
    ,(COUNT(col2)/MAX(total)) as perc 
from (
    select col2, cat1, cat2 
     ,COUNT(cat1) OVER (PARTITION BY col2) as total 
    from some_table 
    ) x 
group by cat1, cat2, col2, datetimestamp 
+0

我使用这些使用样品#2。我遇到了'datetimestamp'的问题,所以我将它添加到内部select语句中。同样,我将'perc'乘以100,以便更接近地模仿百分比符号的外观。我的编辑会影响准确性吗?我在上面的示例数据上测试了你的代码 - 到目前为止,非常好。 – 2014-09-07 06:37:16