2012-09-04 29 views
1

我有以下数据集:如何在猪中生成一定数量的元组?

答:

x1 y z1 
x2 y z2 
x3 y z3 
x43 y z33 
x4 y2 z4 
x5 y2 z5 
x6 y2 z6 
x7 y2 z7 

B:

y 12 
y2 25 

加载:LOAD '$输入' USING PigStorage()AS(K: chararray,m:chararray,n:chararray); 加载B:LOAD'$ input2'使用PigStorage()AS(o:chararray,p:int);

我在o上加入了m和b。我想要做的是仅为每个o选择x个元组。因此,举例来说,如果x为2它的结果是:

x1 y z1 
x2 y z2 
x4 y2 z4 
x5 y2 z5 

回答

1

要做到这一点,你需要使用GROUP BY,FOREACH与嵌套LIMIT,比JOIN或协同组。见实施猪0.10,我用你的输入数据,以获得指定的输出:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray); 
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int); 
-- as join will be on m, we need to leave only 2 rows per a value in m. 
group_A = group A by m; 
top_A_x = foreach group_A { 
    top = limit A 2; -- where x = 2 
    generate flatten(top); 
}; 

-- another way to do join, allows us to do left or right joins and checks 
co_join = cogroup top_A_x by (m), B by (o); 
-- filter out records from A that are not in B 
filter_join = filter co_join by IsEmpty(B) == false; 
result = foreach filter_join generate flatten(top_A_x); 

或者你可以只是一个协同组实现它,FOREACH与嵌套LIMIT:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray); 
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int); 

co_join = cogroup A by (m), B by (o); 
filter_join = filter co_join by IsEmpty(B) == false; 
result = foreach filter_join { 
    top = limit A 2; 
--you can limit B as well 
    generate flatten(top); 
};