2014-06-24 160 views
0

我正在尝试生成聚合输出。问题是所有的数据都会被放入一个reducer中(Filter和Count会产生一个问题)。我如何优化下面的脚本?优化猪脚本

预期输出: 组,10,2,12,34 ...

data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray); 

grp1 = GROUP data BY UA PARALLEL 50; 
fr1 = FOREACH grp1 { 
     fltrCol1 = FILTER data BY Col1 == 'Other'; 
     fltrCol2 = FILTER data BY Col2 == 'Other'; 
     fltrCol3 = FILTER data BY Col3 == 'Other'; 
     fltrCol4 = FILTER data BY col4 == 'Other'; 
     fltrCol5 = FILTER data BY col5 == 'Other'; 
     cnt_fltrCol1 = COUNT(fltrCol1); 
     cnt_fltrCol2 = COUNT(fltrCol2); 
     cnt_fltrCol3 = COUNT(fltrCol3); 
     cnt_fltrCol4 = COUNT(fltrCol4); 
     cnt_fltrCol5 = COUNT(fltrCol5); 
     GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5; 
} 

回答

1

您可以通过添加fltrCol把过滤逻辑组之前{1,2,3,4,5}列作为整数,而不是总结它们。从我的头顶上是脚本:

data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray); 

    filter = FOREACH data GENERATE UA, 
     ((Col1 == 'Other') ? 1 : 0) as fltrCol1, 
     ((Col2 == 'Other') ? 1 : 0) as fltrCol2, 
     ((Col3 == 'Other') ? 1 : 0) as fltrCol3, 
     ((Col4 == 'Other') ? 1 : 0) as fltrCol4, 
     ((Col5 == 'Other') ? 1 : 0) as fltrCol5; 

    grp1 = GROUP data BY UA PARALLEL 50; 

    fr1 = FOREACH grp1 { 
      cnt_fltrCol1 = SUM(fltrCol1); 
      cnt_fltrCol2 = SUM(fltrCol2); 
      cnt_fltrCol3 = SUM(fltrCol3); 
      cnt_fltrCol4 = SUM(fltrCol4); 
      cnt_fltrCol5 = SUM(fltrCol5); 
      GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5; 
    } 
+0

谢谢亚历克斯。有一些数据问题,现在它工作正常。我会实现你的想法来优化。 – Arun