2015-02-10 12 views
0

我们正在GCE上运行hadoop,使用HDFS默认文件系统以及从/到GCS的数据输入/输出。JobTracker - 高内存和本地线程使用

的Hadoop版本:1.2.1 连接器版本:com.google.cloud.bigdataoss:GCS-连接器:1.3.0 hadoop1

观察到的行为:JT会积累线程等待状态,导致OOM:

2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: 
java.lang.OutOfMemoryError: unable to create new native thread 
     at java.lang.Thread.start0(Native Method) 
     at java.lang.Thread.start(Thread.java:714) 
     at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) 
     at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371) 
     at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318) 
     at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275) 
     at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145) 
     at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184) 
     at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168) 
     at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77) 
     at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444) 
     at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860) 
     at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709) 
     at java.security.AccessController.doPrivileged(Native Method) 
     at javax.security.auth.Subject.doAs(Subject.java:415) 
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) 
     at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706) 
     at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890) 
     at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 

通过JT日志中寻找后,我发现这些警告:

2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010 
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams 
     at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150) 
     at org.apache.hadoop.ipc.Client.call(Client.java:1118) 
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) 
     at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source) 
     at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) 
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414) 
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392) 
     at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201) 
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317) 
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783) 
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987) 
Caused by: java.io.IOException: Couldn't set up IO streams 
     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) 
     at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205) 
     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249) 
     at org.apache.hadoop.ipc.Client.call(Client.java:1093) 
     ... 9 more 
Caused by: java.lang.OutOfMemoryError: unable to create new native thread 
     at java.lang.Thread.start0(Native Method) 
     at java.lang.Thread.start(Thread.java:714) 
     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635) 
     ... 12 more 

这似乎是类似于Hadoop的缺陷报告在这里:https://issues.apache.org/jira/browse/MAPREDUCE-5606

我通过禁用节省作业日志到输出路径尝试提出解决方案,它在缺少的日志:)

我也jstack跑JT的代价解决了这个问题,它显示数百个等待或TIMED_WAITING线程因此:

pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000] 
    java.lang.Thread.State: TIMED_WAITING (on object monitor) 
     at java.lang.Object.wait(Native Method) 
     - waiting on <0x000000074d86ba60> (a java.io.PipedInputStream) 
     at java.io.PipedInputStream.read(PipedInputStream.java:327) 
     - locked <0x000000074d86ba60> (a java.io.PipedInputStream) 
     at java.io.PipedInputStream.read(PipedInputStream.java:378) 
     - locked <0x000000074d86ba60> (a java.io.PipedInputStream) 
     at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181) 
     at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque 
st(MediaHttpUploader.java:629) 
     at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader. 
java:409) 
     at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336) 
     at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr 
actGoogleClientRequest.java:419) 
     at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr 
actGoogleClientRequest.java:343) 
     at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl 
eClientRequest.java:460) 
     at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo 
ogleAsyncWriteChannel.java:354) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 
    Locked ownable synchronizers: 
     - <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker) 

JT看起来很难通过GCS连接器与GCS保持通信。

请指教,

谢谢

+0

你碰巧知道你从哪里获取这个gcs-connector-1.3.0-hadoop1.jar?你可以用“hadoop fs -stat gs:// foo”来验证你的gcs-connector版本吗?它应该打印出如“15/02/10 18:16:13 INFO gcs.GoogleHadoopFileSystemBase:GHFS版本:1.3.0-hadoop1”。 – 2015-02-10 18:18:28

+0

> hadoop fs -stat gs:// zulily 2014-07-01 17:19:42 – ichekrygin 2015-02-10 18:26:14

+0

另外,我们使用的是由bdutil安装的gcs-connector '-rw-r - r-- 1 root root 4451217 Jun 6 2014 gcs-connector-1.2.6-hadoop1.jar' – ichekrygin 2015-02-10 18:27:42

回答

0

目前,在GCS连接器Hadoop的每一个开放FSDataOutputStream消耗线程,直到它的关闭,因为一个单独的线程需要运行“可恢复” HTTPRequests的同时, OutputStream的用户间歇地写入字节。在大多数情况下(例如在单个Hadoop任务中),只有一个长寿命的输出流,并且可能还有一些用于编写小型元数据/标记文件的较短寿命的输出流。

一般来说,您遇到的OOM有两种可能的原因:

  1. 您有很多排队工作;每个提交的作业都包含一个未关闭的OutputStream,因此会消耗“等待”线程。但是,由于你提到你只需要排队〜10个工作,这不应该是根本原因。
  2. 某些东西导致PrintWriter对象“泄漏”,最初在logSubmitted中创建并添加到fileManager。通常情况下,终端事件(如logFinished将在通过markCompleted将它们从地图中移除之前正确关闭()所有PrintWriters,但从理论上讲,它们可能是这里或那里的错误,可能会导致OutputStream中的一个泄漏而不close()'d例如,虽然我还没有机会来验证这一说法,似乎IOException异常试图做类似logMetaInfo将“removeWriter” without closing it

我验证过,至少在正常情况下, OutputStream似乎正确关闭,并且我的示例JobTracker在成功运行了大量作业之后显示一个干净的jstack。

TL; DR:有一些工作t理解为什么某些资源可能泄漏并最终阻止创建必要的线程。在此期间,您应该考虑将hadoop.job.history.user.location更改为某个HDFS位置,以便在没有将它们放在GCS上的情况下保留作业日志。