2012-10-12 58 views
0

我有以下数据集:数据标准化生猪脚本

1,11,ab;cd;200 

2,22,pq;rs 

我想这在输出:

1,11,ab 

1,11,cd 

1,11,200 

2,22,pq 

2,22,rs 

如何可以在猪来完成,而无需使用任何UDF?

回答

0

你可以做这样的事情:

A = load '....' using PigStorage(',') as (x,y,data : chararray); 
SPLT = foreach A generate x, y, FLATTEN(STRSPLIT(data,';')); 
X_tmp = foreach SPLT generate $0 as x, $1 as y, FLATTEN(TOBAG($2..$20)) as term; -- pivots the row 
X = filter X_tmp by term is not null; -- this removes the extra bag rows when title was split in less than 20 terms 

的假设是,你不会有数据串超过20元。如果你有更多,增加它。

0

试试这个

A = load 'data' using PigStorage(',') as (x,y,data:chararray); 
    SPLT = foreach A generate x, y, FLATTEN(STRSPLIT(data,';',3)) as (a,b,c); 
    grp = group SPLT by (x,y); 
    res = foreach grp generate group, FLATTEN(SPLT); 
    out = foreach res generate FLATTEN(group), FLATTEN(TOBAG(SPLT::a, SPLT::b, strong textSPLT::c)) as val; 
    dump out;