2017-03-06 59 views
1

所以问题的基本前提是我在hadoop中有一些巨大的表格,我需要从每个月获取一些样本。我嘲笑了下方显示排序后,我的事情,但显然它不是真实的数据...Impala分析函数在where子句中

--Create the table 
CREATE TABLE exp_dqss_team.testranking (
    Name STRING, 
    Age INT, 
    Favourite_Cheese STRING 
) STORED AS PARQUET; 

--Put some data in 
INSERT INTO TABLE exp_dqss_team.testranking 
VALUES (
    ('Tim', 33, 'Cheddar'), 
    ('Martin', 49, 'Gorgonzola'), 
    ('Will', 39, 'Brie'), 
    ('Bob', 63, 'Cheddar'), 
    ('Bill', 35, 'Brie'), 
    ('Ben', 42, 'Gorgonzola'), 
    ('Duncan', 55, 'Brie'), 
    ('Dudley', 28, 'Cheddar'), 
    ('Edmund', 27, 'Brie'), 
    ('Baldrick', 29, 'Gorgonzola')); 

我想要得到的是像最年轻的2人在每个类别的奶酪。下面让我对每个类别的奶酪岁的排名,但不会将其限制前两名:

SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age 
FROM exp_dqss_team.testranking; 

如果我添加一个WHERE条款它给了我下面的错误:

WHERE clause must not contain analytic expressions

SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age 
FROM exp_dqss_team.testranking 
WHERE RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) <3; 

有没有更好的方法来做到这一点比创建一个所有排名表,然后从排名WHERE条款选择?

回答

1

你可以试试吗?

select * from (
SELECT RANK() OVER(PARTITION BY favourite_cheese ORDER BY age asc) AS rank_my_cheese, favourite_cheese, name, age 
FROM exp_dqss_team.testranking 
) as temp 
where rank_my_cheese <= 2; 
+0

谢谢,是的,它的工作原理。我想我可能是在过度思考它! –