Hive合并小ORC文件

我的输入包含大量的小ORC文件，我希望在一天的每一天结束，我想将数据拆分为100MB的块。Hive合并小ORC文件

我的输入和输出都是S3和环境中使用的电子病历，

蜂巢参数，正在设置，

set hive.msck.path.validation=ignore; 
set hive.exec.reducers.bytes.per.reducer=256000000; 
SET hive.exec.dynamic.partition = true; 
SET hive.exec.dynamic.partition.mode = nonstrict; 
SET hive.mapred.mode = nonstrict; 

set hive.merge.mapredfiles=true; 
set hive.merge.mapfile=true ; 

set hive.exec.parallel = true; 
set hive.exec.parallel.thread.number = 8; 

SET hive.exec.stagingdir=/tmp/hive/  ; 
SET hive.exec.scratchdir=/tmp/hive/ ; 

set mapred.max.split.size=68157440; 
set mapred.min.split.size=68157440; 
set hive.merge.smallfiles.avgsize=104857600; 
set hive.merge.size.per.task=104857600; 
set mapred.reduce.tasks=10;

我的插入语句：

insert into table dev.orc_convert_zzz_18 partition(event_type) select * from dev.events_part_input_18 where event_type = 'ScreenLoad' distribute by event_type;

现在的问题是，我有大约80个总共500MB大小的输入文件，并且在这个插入语句之后，我期待S3中有4个文件，但所有这些文件都合并成一个文件，这个文件是n所需的输出。

有人可以请让我知道，什么错，

来源

2017-10-28 Rajiv Chodisetti

'mapred'性能都已过时 –

@ cricket_007哦确定，由于将检查。我刚才想出了答案，我们可以使用集群来进一步将分区拆分为多个部分。我在这里探索配置单元是因为我的火花输出有太多小文件，如果我通过Presto将最小的文件暴露给最终用户，那么查询这些较小的文件将会变得更慢https://community.hortonworks.com/content/supportkb/49637 /hive-bucketing-and-partitioning.html –

您应该在Spark中使用'coalesce'或'repartition'来修复您的小文件问题 –

您正在使用2个不同的概念来控制输出文件

：

如果你只是想在每个目录中有4个文件，你可以通过一个随机数分配，例如：

insert into table dev.orc_convert_zzz_18 partition(event_type) 
select * from dev.events_part_input_18 
where event_type = 'ScreenLoad' distribute by Cast((FLOOR(RAND()*4.0)) as INT);

但我会建议通过您可能查询的数据中的某个列进行分发。它可以改善您的查询时间。

可以阅读更多关于它here

来源

2017-10-29 03:24:44 lev

嗨@lev我试过这个，但得到30个分区，任何想法如何控制，我试着设置减速器10认为它会导致10个文件，但我仍然得到30 –

你是对的，'兰德'返回0和1之间的双。我修复了答案 – lev

我也试过这个，但它没有奏效，请在这里找到屏幕截图，不知道我在这里做了什么错误https://ibb.co/eFqorR –

Hive合并小ORC文件

回答

相关问题