2014-04-29 52 views
1

这两级pig处理工程:如何将两条猪语句合并为一个?

my_out = foreach (group my_in by id) { 
    grouped = BagGroup(my_in.(keyword,weight),my_in.keyword); 
    generate 
    group as id, 
    CountEach(my_in.domain) as domains, 
    grouped as grouped; 
}; 
my_out1 = foreach my_out { 
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight; 
    generate id, domains, keywords; 
}; 

然而,当我将它们合并:

my_out = foreach (foreach (group my_in by id) { 
    grouped = BagGroup(my_in.(keyword,weight),my_in.keyword); 
    generate 
    group as id, 
    CountEach(my_in.domain) as domains, 
    grouped as grouped; 
    }) { 
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight; 
    generate id, domains, keywords; 
    }; 

我得到一个错误:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5. 

我的问题是:

  1. 如何避免此错误?
  2. 它甚至有道理我正在尝试做什么? 即使我设法做到这一点,这将节省我一个MR通行证?

回答

2

一般来说,Pig解析复杂嵌套表达式的能力是不可靠的。另一个常见的错误,当嵌套太多处理是ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""

我经常尝试这样做,以避免必须拿出一堆别名的名称,除了作为计算中的中间步骤没有意义。但有时候这是不可能的,正如你发现的那样。我的猜测是嵌套的foreach是不行的。但就你而言,它看起来像第一个嵌套的foreach是没有必要的。试试这个:

my_out = foreach (foreach (group my_in by id) 
    generate 
    group as id, 
    CountEach(my_in.domain) as domains, 
    BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped 
) { 
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight; 
    generate id, domains, keywords; 
    }; 

关于你的第二个问题,没有,这将使了最终的MR计划没有什么区别。这纯粹是Pig解析脚本的问题;通过以这种方式分组命令,map-reduce逻辑不变。

+1

我得到'ERROR 1000:解析时出错。词汇错误在第25行第0列。遇到:之后:“”你的代码 – sds

+0

Darn。那么你可能会倒霉。但请放心,它不会添加任何map-reduce作业来将语句拆分。 –