2015-05-05 82 views
0

下面的代码按预期工作:阿帕奇猪 - 加入其次是投影结果的NULL

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3); 
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3; 
a_b = join a by a1, b by b1; --inner join 

当我检查领域,无误。

但是,一旦我将投影添加到混合中,它就不起作用。

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3); 
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3; 
a_b = join a by a1, b by b1; --inner join 
ab = foreach a_b generate a1 as a1, a2 as a2, b2 as b2; 

在ab中,来自b的字段中的所有单元都是NULL。

同样的事情发生,如果我这样做:

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3); 
a2 = foreach a generate a1, a2; 
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3; 
b2 = foreach b generate b1, b2; 
ab = join a2 by a1, b2 by b1; 

我用以下解决方法,但讨厌被陷入了存储/负载:

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3); 
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3; 
a_b = join a by a1, b by b1; --inner join 
store a_b into 'hdfs:///a_b_temp' using PigStorage('\t','-schema'); 
a_b2 = load 'hdfs:///a_b_temp' using PigStorage('\t'); 
ab = foreach a_b2 generate a1 as a1, a2 as a2, b2 as b2; 

而在AB的领域做不成为NULL。如果我跳过最后投影

ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Long 

然而,这种错误消失:但是,如果我再组并进行汇总,通常我得到的错误。

我是新来的猪 - 是否有任何已知的错误/问题可能导致此?我观察到它发生了几次不同的数据集。

我在Amazon AWS EMR上使用猪0.12。

感谢您的帮助!

回答

1

我试着用你的第二种方法,这里是代码。

a = load '/user/root/pig/file1.txt' using PigStorage('\t') as (a1:int, a2:chararray, a3:chararray); 
b = load '/user/root/pig/file2.txt' using PigStorage('\t') as (b1:int, b2:chararray, b3:chararray); 

--inner join 
a_b = join a by a1, b by b1; 

--if your goal is to get selected field from relation b based on join condition. 
--a::a1 says "there is a record from "a" and that has a column called a1" 
ab = foreach a_b generate a::a1, a2, b2; 

--If your goal is to get all matching data on id from both relations. 
--ab = foreach a_b generate $0..; 

DUMP ab; 

希望它能帮助你。

+0

感谢您的回复。我的理解是:: ::只有在两个关系之间有重复的字段名时才是必需的。这是不是真的? – TaylerJones

+0

这是没有必要的。在JOIN之后,仍然支持标识字段名称。你可以查看更多细节:http://pig.apache.org/docs/r0.9.1/basic.html#disambiguate –