2017-05-16 58 views
0

集分析我有一个电影数据库中的以下数据集:电影数据使用PIG

评分:用户ID,MovieID,评级::电影:MovieID,标题::用户:用户ID,性别,年龄

现在,我必须加入上述3个数据集,并确定哪部电影在女性中评分最高,男性中评分最低,反之亦然。 我也做了JOIN:

myusers = LOAD '/user/cloudera/movies/input/users.dat' 
    USING PigStorage(':') 
    AS (user:int, n1, gender:chararray, n2, age:int); 

ratings = LOAD '/user/cloudera/movies/input/ratings.dat' 
    USING PigStorage(':') 
    AS (user:int, n1, movie:int, n2, rating:int); 

movies = LOAD '/user/cloudera/movies/input/movies.dat' 
    USING PigStorage(':') 
    AS (movie:int,n1,title:chararray); 

data = JOIN ratings BY user, myusers BY user; 
data2= JOIN data BY ratings::movie, movies BY movie; 

但毕竟这我遇到了许多问题,如“ERROR 0:标有在输出多行”,当我尝试从数据2打印列。任何想法来帮助我完成这项任务?

回答

0

以下步骤后

data = JOIN ratings BY user, myusers BY user; 

利用性别作为filter.Order数据集建立两个数据集一个为男性,另一个为女性,并得到最大和最小两个数据集。

male = FILTER data by gender == 'M'; -- Use the gender value for male 
female = FILTER data by gender == 'F'; 
m_max = LIMIT (ORDER male by rating DESC) 1; 
f_max = LIMIT (ORDER female by rating DESC) 1; 
m_min = LIMIT (ORDER male by rating ASC) 1; 
f_min = LIMIT (ORDER female by rating ASC) 1;