0
- 我正在使用Hive 1.2.1000.2.4.2.0-258。
- 有4850000+表中的行,73和74,以及3 cols- GROUP_ID,A和B之间14511列A的
- GROUP_ID实际上等于0。
- 几乎所有的A和B的是整数。
我用下面的脚本来找到一个表统计摘要:蜂巢percentile_approx函数被破坏,不是吗?
select group_id, --group_id=0 a constant
percentile_approx(A , 0.5) as A_mdn,
percentile_approx(A , 0.25) as A_Q1,
percentile_approx(A , 0.75) as A_Q3,
percentile_approx(A , array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile_approx(B , 0.5) as B_mdn,
percentile_approx(B , 0.25) as B_Q1,
percentile_approx(B , 0.75) as B_Q3,
percentile_approx(B , array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id;
我得到的结果是:
0
73.21058033222496
73.21058033222496
462.16968382794516
[73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,8.11278644563614,17.0]
然后我改变了代码如下:
select group_id, --group_id=0 a constant
percentile(cast(A as bigint), 0.5) as A_mdn,
percentile(cast(A as bigint), 0.25) as A_Q1,
percentile(cast(A as bigint), 0.75) as A_Q3,
percentile(cast(A as bigint), array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile(cast(B as bigint), 0.5) as B_mdn,
percentile(cast(B as bigint), 0.25) as B_Q1,
percentile(cast(B as bigint), 0.75) as B_Q3,
percentile(cast(B as bigint), array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id
新结果是:
0
72.0
6.0
762.0
[3.0,1.0,1.0,0.0,0.0,0.0]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,9.0,17.0]
要仔细检查实话,我也加载此表R.以下是R-结果:
A:
Min 0
Q1: 6
Median: 72
Q3: 762
0.2 quantile: 3
0.15 quantile: 1.5
0.1 quantile: 1
0.05 quantile: 0
0.025 quantile:0
0.001 quantile:0
B
Q1: 1
Median: 1
Q3: 2
0.8 quantile: 2
0.85 quantile: 3
0.9 quantile: 4
0.95 quantile: 9
0.975 quantile:17
显然,R结果与百分功能一致,但percentile_approx给我错误的答案。
我的观点是,percentile_approx的结果是错误的,甚至不是“大约”。 –
试试这个:percentile_approx(A,0.5,)as A_mdn – AkashNegi
忽视上述评论。试试这个:percentile_approx(A,0.5,[,5000000])作为A_mdn。 '5000000'参数应该>数据集中不同记录的数量。你说你有4.85m +记录,所以试着运行上面的命令,看看结果是否接近。 – AkashNegi