2016-11-29 22 views
0
  • 我正在使用Hive 1.2.1000.2.4.2.0-258。
  • 有4850000+表中的行,73和74,以及3 cols- GROUP_ID,A和B之间14511列A的
  • GROUP_ID实际上等于0。
  • 几乎所有的A和B的是整数。

我用下面的脚本来找到一个表统计摘要:蜂巢percentile_approx函数被破坏,不是吗?

select group_id, --group_id=0 a constant 
    percentile_approx(A , 0.5) as A_mdn, 
    percentile_approx(A , 0.25) as A_Q1, 
    percentile_approx(A , 0.75) as A_Q3, 
    percentile_approx(A , array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i, 
    min(A) as min_A, 
    percentile_approx(B , 0.5) as B_mdn, 
    percentile_approx(B , 0.25) as B_Q1, 
    percentile_approx(B , 0.75) as B_Q3, 
    percentile_approx(B , array(0.8,0.85, 0.9, 0.95,0.975)) as B_i 
    from table 
    group by group_id; 

我得到的结果是:

0 
73.21058033222496 
73.21058033222496 
462.16968382794516 
[73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496] 
0.0 
1.0 
1.0 
2.0 
[2.0,3.0,4.0,8.11278644563614,17.0] 

然后我改变了代码如下:

select group_id, --group_id=0 a constant 
    percentile(cast(A as bigint), 0.5) as A_mdn, 
    percentile(cast(A as bigint), 0.25) as A_Q1, 
    percentile(cast(A as bigint), 0.75) as A_Q3, 
    percentile(cast(A as bigint), array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i, 
    min(A) as min_A, 
    percentile(cast(B as bigint), 0.5) as B_mdn, 
    percentile(cast(B as bigint), 0.25) as B_Q1, 
    percentile(cast(B as bigint), 0.75) as B_Q3, 
    percentile(cast(B as bigint), array(0.8,0.85, 0.9, 0.95,0.975)) as B_i 
    from table 
    group by group_id 

新结果是:

0 
72.0  
6.0 
762.0  
[3.0,1.0,1.0,0.0,0.0,0.0] 
0.0 
1.0 
1.0 
2.0 
[2.0,3.0,4.0,9.0,17.0] 

要仔细检查实话,我也加载此表R.以下是R-结果:

A: 
Min 0 
Q1: 6 
Median: 72 
Q3: 762 
0.2 quantile: 3 
0.15 quantile: 1.5 
0.1 quantile: 1 
0.05 quantile: 0 
0.025 quantile:0 
0.001 quantile:0 

B 
Q1: 1 
Median: 1 
Q3: 2 
0.8 quantile: 2  
0.85 quantile: 3 
0.9 quantile: 4 
0.95 quantile: 9 
0.975 quantile:17 

显然,R结果与百分功能一致,但percentile_approx给我错误的答案。

回答

0

如果“所有”的值都是整数,则此函数返回一个真值。你说几乎所有的A和B都是整数。

尝试将完整的列A转换为int并查看您是否接近答案。

我不认为你会得到与R完全相同的答案,因为R的百分位数函数最有可能也是非整数。

获得确切答案的一种方法是编写自己的UDF并使用它。

希望这会有所帮助!

+0

我的观点是,percentile_approx的结果是错误的,甚至不是“大约”。 –

+0

试试这个:percentile_approx(A,0.5,)as A_mdn – AkashNegi

+0

忽视上述评论。试试这个:percentile_approx(A,0.5,[,5000000])作为A_mdn。 '5000000'参数应该>数据集中不同记录的数量。你说你有4.85m +记录,所以试着运行上面的命令,看看结果是否接近。 – AkashNegi