2012-09-04 64 views
20

我试图从作业跟踪器收集一些信息。对于初学者来说,我想先从让正在运行的作业信息,如作业ID或作业名等,但已经卡住了,这里是我的本钱(打印出的作业ID为当前运行的作业):混淆了hadoop作业跟踪器api

public static void main(String[] args) throws IOException { 
     Configuration conf = HBaseConfiguration.create(); 
     conf.set("hbase.zookeeper.quorum", "zk1.myhost,zk2.myhost,zk3.myhost"); 
     conf.set("hbase.zookeeper.property.clientPort", "2181"); 

     InetSocketAddress jobtracker = new InetSocketAddress("jobtracker.mapredhost.myhost", 8021); 
     JobClient jobClient = new JobClient(jobtracker, conf); 
     JobStatus[] jobs = jobClient.jobsToComplete(); 

     for (int i = 0; i < jobs.length; i++) { 
      JobStatus js = jobs[i]; 
      if (js.getRunState() == JobStatus.RUNNING) { 
       JobID jobId = js.getJobID(); 
       System.out.println(jobId); 
      } 
     } 
    } 

这个以上当试图显示工作id时,作为魅力,但现在我想显示作业名称。所以我加了打印作业ID后,这条线:

System.out.println(jobClient.getJob(jobId).getJobName()); 

我得到这个异常:

Exception in thread "main" java.lang.NullPointerException 
    at org.apache.hadoop.mapred.JobClient$NetworkedJob.<init>(JobClient.java:226) 
    at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:1080) 
    at org.apache.test.JobTracker.main(JobTracker.java:28) 

jobClientnull。我知道这是因为我试着用空检查语句,但是这个jobClient.getJob(jobId)null。我在这里做错了什么?

根据API我应该没问题,

http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/JobClient.html#getJob(org.apache.hadoop.mapred.JobID)

首先从jobClient得到RunningJob比,一旦你已经运行的作业,然后把它的名字http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/RunningJob.html#getJobName()

任何人做了这样的事情之前?我可以使用jsoup通过GET请求获取此信息,但我认为这是获取此信息的更好方法。这里

问题的更新是我的Hadoop/HBase的依赖关系:

<dependency> 
      <groupId>org.apache.hadoop</groupId> 
      <artifactId>hadoop-client</artifactId> 
      <version>0.23.1-mr1-cdh4.0.0b2</version> 
     </dependency> 
     <dependency> 
      <groupId>org.apache.hadoop</groupId> 
      <artifactId>hadoop-core</artifactId> 
      <version>0.23.1-mr1-cdh4.0.0b2</version> 
      <exclusions> 
       <exclusion> 
        <groupId>org.mortbay.jetty</groupId> 
        <artifactId>jetty</artifactId> 
       </exclusion> 
       <exclusion> 
        <groupId>javax.servlet</groupId> 
        <artifactId>servlet-api</artifactId> 
       </exclusion> 
      </exclusions> 
     </dependency> 
     <dependency> 
      <groupId>org.apache.hbase</groupId> 
      <artifactId>hbase</artifactId> 
      <version>0.92.1-cdh4b2-SNAPSHOT</version> 
     </dependency> 

赏金更新:

这里是我的进口:

import java.io.IOException; 
import java.net.InetSocketAddress; 

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.hbase.HBaseConfiguration; 
import org.apache.hadoop.mapred.JobClient; 
import org.apache.hadoop.mapred.JobID; 
import org.apache.hadoop.mapred.JobStatus; 

这里是System.out.println(jobId)输出:

job_201207031810_1603 

目前只有一份工作正在运行。

+1

您正在使用什么版本? 0.21喜欢在你的文档链接? –

+0

您好托马斯,这是很好的观察我会更新我的问题 –

+0

所以你的群集运行在CDH4 0.23.1像你的依赖? –

回答

17

看看JobClient的内部类NetworkedJob
(来源:/home/user/hadoop/src/mapred/org/apache/hadoop/mapred/JobClient.java)

它的构造试图在线路225 JobClient获取Configuration对象,但因为它是空new JobClient(InetSocketAddress jobTrackAddr, Configuration conf)不设置它:

// Set the completion poll interval from the configuration. 
     // Default is 5 seconds. 
     Configuration conf = JobClient.this.getConf(); 
     this.completionPollIntervalMillis = conf.getInt(COMPLETION_POLL_INTERVAL_KEY, 
      DEFAULT_COMPLETION_POLL_INTERVAL); //NPE occurs here! 

作为一种变通方法,创造了JobClient对象之后手动设置。这将解决你的问题:

.. 
JobClient jobClient = new JobClient(jobtracker, conf); 
jobClient.setConf(conf); 
.... 

旁注:

我通过实例化对象Configuration

Configuration conf = new Configuration(); 
conf.addResource(new Path("/path_to/core-site.xml")); 
conf.addResource(new Path("/path_to/hdfs-site.xml")); 
+0

优秀的观察先生!如果你手工将setConf设置为jobClient,则无法分配赏金 –

+0

@GandalfStormCrow你可以随时点击Lorand的答案旁边的小按钮250来奖赏赏金 – HypnoticSheep