2016-08-07 48 views
0

我想创建一个表,将显示百分比的出现次数。例如:我有一个表,命名为例如包含数据为:PIG:如何创建基于百分比(%)的表?

class, value 
------ ------- 
1  , abc 
1  , abc 
1  , xyz 
1  , abc 
2  , xyz 
2  , abc 

这里,对于类值1,“ABC”时发生3次和“XYZ”只发生一次出总发生的的4倍。对于班级值2,“abc”和“xyz”发生一次(总共出现两次)。

所以,输出是:

class, %_of_abc, %_of_xyz 
------ -------- -------- 
1  , 75  , 25 
2  , 50  , 50 

任何想法如何做到这一点其中两个列值发生改变?我正在考虑使用GROUP。但不知道我是否按照课程价值分组,如何帮助我。

回答

0

有点复杂,但这里的解决方案

grunt> Dump A; 
(1,abc) 
(1,abc) 
(1,xyz) 
(1,abc) 
(2,xyz) 
(2,abc) 
grunt> B = Group A by class; 
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt; 
grunt> D = Group A by (class,value);   
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt; 
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt; 
grunt> G = JOIN F BY class,C BY class; 
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc; 
grunt> Dump H; 
(1,xyz,25) 
(1,abc,75) 
(2,xyz,50) 
(2,abc,50) 
I = grouy H by class; 
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc)); 
Dump J; 
(1,75,25) 
(2,50,50) 
+0

谢谢!完美地工作! – Tanvir