2016-09-17 27 views
8

问题 -为什么hive_staging文件丢失在AWS EMR

我在AWS EMR运行1个查询。它通过抛出异常失败 -

java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist. 

我在下面提到了这个问题的所有相关信息。请检查。

查询 -

INSERT OVERWRITE TABLE base_performance_order_dedup_20160917 
SELECT 
* 
FROM 
(
select 
commerce_feed_redshift_dedup.sku AS sku, 
commerce_feed_redshift_dedup.revenue AS revenue, 
commerce_feed_redshift_dedup.orders AS orders, 
commerce_feed_redshift_dedup.units AS units, 
commerce_feed_redshift_dedup.feed_date AS feed_date 
from commerce_feed_redshift_dedup 
) tb 

异常 -

ERROR Error while executing queries 
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1474097800415_0311_2_00, diagnostics=[Vertex vertex_1474097800415_0311_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: commerce_feed_redshift_dedup initializer failed, vertex=vertex_1474097800415_0311_2_00 [Map 1], java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist. 
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987) 
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929) 
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339) 
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530) 
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537) 
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556) 
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601) 
    at org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778) 
    at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777) 
    at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281) 
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363) 
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486) 
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200) 
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) 
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) 
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) 
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1474097800415_0311_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1474097800415_0311_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 
    at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348) 
    at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251) 
    at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234) 
    at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184) 
    at com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:606) 
    at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192) 
    at azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132) 
    at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76) 

蜂房配置属性,即我上面的查询执行之前设置。 -

set hivevar:hive.mapjoin.smalltable.filesize=2000000000 
set hivevar:mapreduce.map.speculative=false 
set hivevar:mapreduce.output.fileoutputformat.compress=true 
set hivevar:hive.exec.compress.output=true 
set hivevar:mapreduce.task.timeout=6000000 
set hivevar:hive.optimize.bucketmapjoin.sortedmerge=true 
set hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec 
set hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat 
set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false 
set hivevar:FEED_DATE=20160917 
set hivevar:hive.optimize.bucketmapjoin=true 
set hivevar:hive.exec.compress.intermediate=true 
set hivevar:hive.enforce.bucketmapjoin=true 
set hivevar:mapred.output.compress=true 
set hivevar:mapreduce.map.output.compress=true 
set hivevar:hive.auto.convert.sortmerge.join=false 
set hivevar:hive.auto.convert.join=false 
set hivevar:mapreduce.reduce.speculative=false 
set hivevar:[email protected] 
set hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec 
set hive.mapjoin.smalltable.filesize=2000000000 
set mapreduce.map.speculative=false 
set mapreduce.output.fileoutputformat.compress=true 
set hive.exec.compress.output=true 
set mapreduce.task.timeout=6000000 
set hive.optimize.bucketmapjoin.sortedmerge=true 
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec 
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat 
set hive.auto.convert.sortmerge.join.noconditionaltask=false 
set FEED_DATE=20160917 
set hive.optimize.bucketmapjoin=true 
set hive.exec.compress.intermediate=true 
set hive.enforce.bucketmapjoin=true 
set mapred.output.compress=true 
set mapreduce.map.output.compress=true 
set hive.auto.convert.sortmerge.join=false 
set hive.auto.convert.join=false 
set mapreduce.reduce.speculative=false 
set [email protected] 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec 

/etc/hive/conf/hive-site.xml

<configuration> 

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files --> 
<!-- that are implied by Hadoop setup variables.            --> 
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive --> 
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized --> 
<!-- resource).                     --> 

<!-- Hive Execution Parameters --> 


<property> 
    <name>hbase.zookeeper.quorum</name> 
    <value>ip-172-30-2-16.us-west-2.compute.internal</value> 
    <description>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description> 
</property> 

<property> 
    <name>hive.execution.engine</name> 
    <value>tez</value> 
</property> 

    <property> 
    <name>fs.defaultFS</name> 
    <value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value> 
    </property> 


    <property> 
    <name>hive.metastore.uris</name> 
    <value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value> 
    <description>JDBC connect string for a JDBC metastore</description> 
    </property> 

    <property> 
    <name>javax.jdo.option.ConnectionURL</name> 
    <value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value> 
    <description>username to use against metastore database</description> 
    </property> 

    <property> 
    <name>javax.jdo.option.ConnectionDriverName</name> 
    <value>org.mariadb.jdbc.Driver</value> 
    <description>username to use against metastore database</description> 
    </property> 

    <property> 
    <name>javax.jdo.option.ConnectionUserName</name> 
    <value>hive</value> 
    <description>username to use against metastore database</description> 
    </property> 

    <property> 
    <name>javax.jdo.option.ConnectionPassword</name> 
    <value>mrN949zY9P2riCeY</value> 
    <description>password to use against metastore database</description> 
    </property> 

    <property> 
    <name>datanucleus.fixedDatastore</name> 
    <value>true</value> 
    </property> 

    <property> 
    <name>mapred.reduce.tasks</name> 
    <value>-1</value> 
    </property> 

    <property> 
    <name>mapred.max.split.size</name> 
    <value>256000000</value> 
    </property> 

    <property> 
    <name>hive.metastore.connect.retries</name> 
    <value>15</value> 
    </property> 

    <property> 
    <name>hive.optimize.sort.dynamic.partition</name> 
    <value>true</value> 
    </property> 

    <property> 
    <name>hive.async.log.enabled</name> 
    <value>false</value> 
    </property> 

</configuration> 

/etc/tez/conf/tez-site.xml

<configuration> 
    <property> 
    <name>tez.lib.uris</name> 
    <value>hdfs:///apps/tez/tez.tar.gz</value> 
    </property> 

    <property> 
    <name>tez.use.cluster.hadoop-libs</name> 
    <value>true</value> 
    </property> 

    <property> 
    <name>tez.am.grouping.max-size</name> 
    <value>134217728</value> 
    </property> 

    <property> 
    <name>tez.runtime.intermediate-output.should-compress</name> 
    <value>true</value> 
    </property> 

    <property> 
    <name>tez.runtime.intermediate-input.is-compressed</name> 
    <value>true</value> 
    </property> 

    <property> 
    <name>tez.runtime.intermediate-output.compress.codec</name> 
    <value>org.apache.hadoop.io.compress.LzoCodec</value> 
    </property> 

    <property> 
    <name>tez.runtime.intermediate-input.compress.codec</name> 
    <value>org.apache.hadoop.io.compress.LzoCodec</value> 
    </property> 

    <property> 
    <name>tez.history.logging.service.class</name> 
    <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value> 
    </property> 

    <property> 
    <name>tez.tez-ui.history-url.base</name> 
    <value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value> 
    </property> 
</configuration> 

问题 -

  1. 哪个进程删除了这个文件?对于配置单元,这个文件应该只在那里。 (此外,该文件不是由应用程序代码创建的。)
  2. 当我运行失败的查询次数时,它通过。为什么有暧昧的行为?
  3. 因为我只是将hive-exec,hive-jdbc版本升级到2.1.0。所以,它似乎像一些配置属性错误设置或某些属性丢失。你能帮我找到错误的设置/错过的蜂巢属性吗?

注 - 我将hive-exec版本从0.13.0升级到2.1.0。在以前的版本中,所有查询都正常工作。

更新-1

当我启动另一个群集,它工作得很好。我在同一个ETL上测试了3次。

当我在新的群集上再次做同样的事情时,它显示相同的异常。无法理解,为什么这种模糊性正在发生。

帮助我理解这种模糊性。

我在处理蜂巢时很天真。因此,对此有较少的概念性意见。

更新-2-

集群公共DNS名称HFS日志:50070 -

2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy (IPC Server handler 11 on 8020): Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server (IPC Server handler 11 on 8020): IPC Server handler 11 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 172.30.2.207:56462 Call#7497 Retry#0 java.io.IOException: File /user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58/_task_tmp.-ext-10000/_tmp.000079_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

当我搜索这个例外。我发现此页面 - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

在我的群集中,有一个数据节点具有32 GB的磁盘空间。

** /etc/hive/conf/hive-default.xml.template - **

<property> 
    <name>hive.exec.stagingdir</name> 
    <value>.hive-staging</value> 
    <description>Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.</description> 
    </property> 

问题 -

  1. 作为每日志,创建蜂房暂存文件夹在群集机器中,按照/var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log,那么为什么它在s3中也创建了相同的文件夹呢?

更新-3-

一些例外的类型 - LeaseExpiredException -

2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 8020): IPC Server handler 13 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab8f-d6c78124b636/.tez/application_1474442135017_0022/recovery/1/summary (inode 20326): File does not exist. Holder DFSClient_NONMAPREDUCE_1375788009_1 does not have any open files.

+0

解释LeaseExpiredException问题在这里详细 - stackoverflow.com/questions/39619130/...请帮助我,如果有人面临类似的问题。我花了好几天的时间找出根本原因,但无法找到答案。 – devsda

回答

2

我解决了问题。让我详细解释一下。

例外来了 -

  1. LeaveExpirtedException - 从HDFS侧。
  2. FileNotFoundException异常 - 从蜂巢边时(TEZ执行引擎执行DAG)

问题scenario-

  1. 我们刚刚升级的蜂巢版本从0.13.0至2.1.0。而且,以前的版本一切正常。零运行时异常。

不同的想法来解决问题 -

  1. 首先想到的是,两个线程正在研究,因为NN情报的同一块。但是,按照下面的设置

    集mapreduce.map.speculative =假 集mapreduce.reduce.speculative =假

这是不可能的。

  • 然后,我增加的计数为1000〜100000以下设置 -

    SET hive.exec.max.dynamic.partitions = 100000; SET hive.exec.max.dynamic.partitions.pernode = 100000;

  • 那也没有工作。

    1. 然后第三个想法是,在同一个过程中,映射器-1的创建被另一个映射器/缩减器删除。但是,我们在Hveserver2,Tez日志中没有找到任何这样的日志。

    2. 最后,根本原因在于应用层代码本身。在蜂房的exec-2.1.0版本,他们推出了新的配置属性

      “hive.exec.stagingdir”: “蜂房升级。” 上述房产的

    描述 -

    将在表格位置内创建的目录名称,以便 支持HDFS加密。对于 查询结果,这将取代$ {hive.exec.scratchdir},但只读表格除外。在所有情况下, $ {hive.exec.scratchdir}仍用于其他临时文件,例如 作为工作计划。

    因此,如果在应用层代码(ETL)中有任何并发​​作业,并且正在同一张表上执行操作(重命名/删除/移动),则可能导致此问题。

    而在我们的例子中,2个并发作业在同一个表上执行“INSERT OVERWRITE”,导致删除1个映射器的元数据文件,这是导致此问题的原因。

    分辨率 -

    1. 移动所述元数据文件位置到外表(表在于S3)。
    2. 禁用HDFS加密(如stagingdir属性说明中所述)
    3. 更改为您的应用程序层代码以避免并发问题。
    +0

    对于解决方案(1) - 您是否最终在s3上保留hive.exec.stagingdir的位置?或者你最终重定向到HDFS? – etliens

    +0

    嘿!那么你对hive.exec.stagingdir属性做了哪些修改? – jackStinger

    +1

    @etilens - 是的,位置仅在S3上。 – devsda