如何使用Hive确定HDFS中的文件大小

我正在使用的工作区是使用Hive 1.1.0和CDH 5.5.4进行设置的。我做了一个带有22个分区结果的查询。保存在此分区目录中的文件始终是唯一的，可以从20MB变为700MB。如何使用Hive确定HDFS中的文件大小

从我所了解的情况来看，这与查询过程中使用的reducer的数量有关。 Let's假设我想有5个文件，为每个分区，而不是1，我用这个命令：

set mapreduce.job.reduces=5;

这将使系统使用5降低1级任务，但会自动切换到1个减速阶段2（在编译时自动确定）。从我读到的情况来看，这是由于编译器在选择减速器数量时比配置更重要。看起来某些任务不能被“平行”，只能由一个进程或减速任务完成，因此系统会自动确定它。

代码：

insert into table core.pae_ind1 partition (project,ut,year,month) 
select ts,date_time, if(
-- m1 
code_ac_dcu_m1_d1=0 
and (min(case when code_ac_dcu_m1_d1=1 then ts end) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15, 
min(case when code_ac_dcu_m1_d1=1 then ts end) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts,NULL) as 
t_open_dcu_m1_d1, 

if(code_ac_dcu_m1_d1=2 
and (min(case when code_ac_dcu_m1_d1=3 then ts end) over (partition by ut 
order by ts rows between 1 following and 1000 following)-ts) <= 15, 
min(case when code_ac_dcu_m1_d1=3 then ts end) over (partition by ut order 
by ts rows between 1 following and 1000 following)-ts, NULL) as 
t_close_dcu_m1_d1, 
project,ut,year,month 

from core.pae_open_close 
where ut='902' 
order by ut,ts

这导致末端具有巨大的文件。我想知道是否有办法将这些结果文件拆分为较小的文件（最好是按大小限制它们）。

来源

2017-07-27 LSG

'order by ut，ts'？ –

正如@DuduMarkovitz所指出的，您的代码包含指令来全局排序数据集。这将在单个减速器上运行。您从您的表中选择时更好地订购。即使您的文件在插入后依然存在并且它们是可拆分的 - 它们将在许多映射器上被读取，然后由于并行性而导致结果不顺序，您将需要订购。刚刚摆脱这种order by ut,ts在插入和使用这些配置设置来控制减速的数量：根据

mapred.reduce.tasks确定

set hive.exec.reducers.bytes.per.reducer=67108864; 
set hive.exec.reducers.max = 2000; --default 1009

减速的数量 - 降低每工作任务的默认数量。通常设置为接近可用主机的数量。当mapred.job.tracker是“local”时忽略。 Hadoop默认设置为1，而Hive使用-1作为默认值。通过将此属性设置为-1，Hive将自动计算出应该是减速器数量的数量。

- Hive 0.14.0及更早版本的默认值为1 GB。

还hive.exec.reducers.max - 将使用的最大减速器数量。如果mapred.reduce.tasks为负数，则Hive将在自动确定减速器数量时将此数用作减速器的最大数量。

所以，如果你想增加减速并行，增加hive.exec.reducers.max，降低每个减速将创建一个文件为每个分区（不超过hive.exec.reducers.bytes.per.reducer更大）。一个reducer可能会收到很多分区数据，因此会在每个分区中创建许多小文件。这是因为在洗牌阶段分区数据将分布在许多减速器之间。

如果你不想让每个reducer创建每个（或太多）分区，然后distribute by partition key（而不是顺序）。在这种情况下，分区中的文件数量将更像partition_size/hive.exec.reducers.bytes.per.reducer

来源

2017-07-27 12:32:42 leftjoin

查看'distribute by'的更新 – leftjoin

如何使用Hive确定HDFS中的文件大小

回答

相关问题