2012-06-01 49 views
4

突然断开我有8个机和1台主计算机运行的Hadoop(版本0.21)HDFS集群的一些数据节点而减速运行

簇的一些数据节点当我在10GB运行的MapReduce代码被突然断开数据 在完成所有映射器并处理了大约80%的reducers之后,随机选择一个或多个datanode从网络中撤销。 然后其他datanodes开始消失从网络,即使我杀了MapReduce作业,当我发现一些datanode断开连接。

我试图将dfs.datanode.max.xcievers更改为4096,关闭了所有计算节点的防火墙,禁用了selinux,并将文件打开限制的数量增加到了20000个,但它们并不适用于全部...

有人有想法解决这个问题吗?

以下是从MapReduce的错误日志

12/06/01 12:31:29 INFO mapreduce.Job: Task Id : attempt_201206011227_0001_r_000006_0, Status : FAILED 
java.io.IOException: Bad connect ack with firstBadLink as ***.***.***.148:20010 
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) 
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) 
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) 

和以下是从数据节点

2012-06-01 13:01:01,118 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5549263231281364844_3453 src: /*.*.*.147:56205 dest: /*.*.*.142:20010 
2012-06-01 13:01:01,136 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020) Starting thread to transfer block blk_-3849519151985279385_5906 to *.*.*.147:20010 
2012-06-01 13:01:19,135 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-5797481564121417802_3453 to *.*.*.146:20010 got java.net.ConnectException: > Connection timed out 
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) 
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) 
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) 
    at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1257) 
    at java.lang.Thread.run(Thread.java:722) 

2012-06-01 13:06:20,342 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6674438989226364081_3453 
2012-06-01 13:09:01,781 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-3849519151985279385_5906 to *.*.*.147:20010 got java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/*.*.*.142:60057 remote=/*.*.*.147:20010] 
    at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) 
    at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164) 
    at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:203) 
    at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:388) 
    at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:476) 
    at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1284) 
    at java.lang.Thread.run(Thread.java:722) 

HDFS-site.xml中

<configuration> 
<property> 
<name>dfs.name.dir</name> 
<value>/home/hadoop/data/name</value> 
</property> 
<property> 
    <name>dfs.data.dir</name> 
       <value>/home/hadoop/data/hdfs1,/home/hadoop/data/hdfs2,/home/hadoop/data/hdfs3,/home/hadoop/data/hdfs4,/home/hadoop/data/hdfs5</value> 
    </property> 
    <property> 
     <name>dfs.replication</name> 
     <value>3</value> 
    </property> 

    <property> 
       <name>dfs.datanode.max.xcievers</name> 
       <value>4096</value> 
    </property> 

    <property> 
      <name>dfs.http.address</name> 
      <value>0.0.0.0:20070</value> 
      <description>50070 
     The address and the base port where the dfs namenode web ui will listen on. 
     If the port is 0 then the server will start on a free port. 
      </description> 
    </property> 

    <property> 
      <name>dfs.datanode.http.address</name> 
      <value>0.0.0.0:20075</value> 
      <description>50075 
     The datanode http server address and port. 
     If the port is 0 then the server will start on a free port. 
      </description> 
    </property> 

    <property> 
     <name>dfs.secondary.http.address</name> 
     <value>0.0.0.0:20090</value> 
     <description>50090 
     The secondary namenode http server address and port. 
     If the port is 0 then the server will start on a free port. 
     </description> 
    </property> 

    <property> 
     <name>dfs.datanode.address</name> 
     <value>0.0.0.0:20010</value> 
     <description>50010 
     The address where the datanode server will listen to. 
     If the port is 0 then the server will start on a free port. 
     </description> 

<property> 
     <name>dfs.datanode.ipc.address</name> 
     <value>0.0.0.0:20020</value> 
     <description>50020 
     The datanode ipc server address and port. 
     If the port is 0 then the server will start on a free port. 
     </description> 
    </property> 

    <property> 
     <name>dfs.datanode.https.address</name> 
     <value>0.0.0.0:20475</value> 
    </property> 

     <property> 
     <name>dfs.https.address</name> 
      <value>0.0.0.0:20470</value> 
     </property> 
</configuration> 
日志

mapred-site.xml中

<configuration> 
    <property> 
      <name>mapred.job.tracker</name> 
      <value>masternode:29001</value> 
    </property> 
    <property> 
      <name>mapred.system.dir</name> 
      <value>/home/hadoop/data/mapreduce/system</value> 
    </property> 
    <property> 
      <name>mapred.local.dir</name> 
      <value>/home/hadoop/data/mapreduce/local</value> 
    </property> 
    <property> 
      <name>mapred.map.tasks</name> 
      <value>32</value> 
      <description> default number of map tasks per job.</description> 
    </property> 
    <property> 
      <name>mapred.tasktracker.map.tasks.maximum</name> 
      <value>4</value> 
    </property> 
    <property> 
      <name>mapred.reduce.tasks</name> 
      <value>8</value> 
      <description> default number of reduce tasks per job.</description> 
    </property> 
    <property> 
      <name>mapred.map.child.java.opts</name> 
      <value>-Xmx2048M</value> 
    </property> 
    <property> 
      <name>io.sort.mb</name> 
      <value>500</value> 
    </property> 
    <property> 
      <name>mapred.task.timeout</name> 
      <value>1800000</value> <!-- 30 minutes --> 
    </property> 


    <property> 
      <name>mapred.job.tracker.http.address</name> 
      <value>0.0.0.0:20030</value> 
      <description> 50030 
      The job tracker http server address and port the server will listen on. 
      If the port is 0 then the server will start on a free port. 
      </description> 
     </property> 

     <property> 
       <name>mapred.task.tracker.http.address</name> 
       <value>0.0.0.0:20060</value> 
       <description> 50060 

     </property> 

</configuration> 
+0

0.21不太稳定,所以这是不是真的suprising我。尝试切换到20.2xx。编辑:版本已重命名为1.0.x –

+0

感谢您的评论。但是,在同一群集中使用1.0.3a版本Hadoop的我的同事遇到了同样的问题。 – user1429825

+0

如果你的合作在同一个集群中使用1.0.3,你会混合并匹配hadoop版本吗?你可以检查运行的所有服务和你的工作代码是否使用相同版本的hadoop –

回答

相关问题