如何知道amazon mapreduce任务何时完成？

我想在amazon ec2上运行mapreduce任务。我设置了所有的配置参数，然后调用AmazonElasticMapReduce服务的runFlowJob方法。我想知道有没有什么办法可以知道这项工作是否已经完成，状态如何。（我需要它知道什么时候我可以从s3拿起mapreduce结果进行进一步处理）如何知道amazon mapreduce任务何时完成？

当前代码只是继续执行，因为runJobFlow的调用是非阻塞的。

 public void startMapReduceTask(String accessKey, String secretKey 
     ,String eC2KeyPairName, String endPointURL, String jobName 
     ,int numInstances, String instanceType, String placement 
     ,String logDirName, String bucketName, String pigScriptName) { 
    log.info("Start running MapReduce"); 

    // config.set 
    ClientConfiguration config = new ClientConfiguration(); 
    AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey); 

    AmazonElasticMapReduce service = new AmazonElasticMapReduceClient(credentials, config); 
    service.setEndpoint(endPointURL); 

    JobFlowInstancesConfig conf = new JobFlowInstancesConfig(); 

    conf.setEc2KeyName(eC2KeyPairName); 
    conf.setInstanceCount(numInstances); 
    conf.setKeepJobFlowAliveWhenNoSteps(true); 
    conf.setMasterInstanceType(instanceType); 
    conf.setPlacement(new PlacementType(placement)); 
    conf.setSlaveInstanceType(instanceType); 

    StepFactory stepFactory = new StepFactory(); 

    StepConfig enableDebugging = new StepConfig() 
    .withName("Enable Debugging") 
    .withActionOnFailure("TERMINATE_JOB_FLOW") 
    .withHadoopJarStep(stepFactory.newEnableDebuggingStep()); 

    StepConfig installPig = new StepConfig() 
    .withName("Install Pig") 
    .withActionOnFailure("TERMINATE_JOB_FLOW") 
    .withHadoopJarStep(stepFactory.newInstallPigStep()); 

    StepConfig runPigScript = new StepConfig() 
    .withName("Run Pig Script") 
    .withActionOnFailure("TERMINATE_JOB_FLOW") 
    .withHadoopJarStep(stepFactory.newRunPigScriptStep("s3://" + bucketName + "/" + pigScriptName, "")); 

    RunJobFlowRequest request = new RunJobFlowRequest(jobName, conf) 
    .withSteps(enableDebugging, installPig, runPigScript) 
    .withLogUri("s3n://" + bucketName + "/" + logDirName); 

    try { 
     RunJobFlowResult res = service.runJobFlow(request); 
     log.info("Mapreduce job with id[" + res.getJobFlowId() + "] completed successfully"); 
    } catch (Exception e) { 
     log.error("Caught Exception: ", e); 
    } 
    log.info("End running MapReduce");  
}

感谢，

阿维亚德

来源

2011-02-24 aviad

从AWS文档：

一旦作业流程完成后，集群停止，HDFS分区丢失。 为防止数据丢失，请配置作业流程的最后一步以将结果存储在Amazon S3中。

它接着说：

如果JobFlowInstancesDetail : KeepJobFlowAliveWhenNoSteps参数设置为TRUE，作业流程将过渡到WAITING状态，而不是关机一次的步骤已经完成。

每个作业流程最多允许256个步骤。

对于长时间运行的工作流程，我们建议您定期存储结果。

所以它看起来好像没有办法知道它什么时候完成。相反，您需要将数据保存为工作的一部分。

来源

2011-03-15 05:37:30

如何知道amazon mapreduce任务何时完成？

回答

相关问题