2012-09-06 146 views
1

我已成功完成Amazon EMR上的mahout矢量化作业(使用Mahout on Elastic MapReduce作为参考)。现在我想将HDFS的结果复制到S3中(在将来的集群中使用它)。将hadoop从hdfs复制到S3

For that I've used hadoop distcp: 

[email protected]:~$ elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \ 
> --arg hdfs://my.bucket/prj1/seqfiles \ 
> --arg s3n://ACCESS_KEY:[email protected]/prj1/seqfiles \ 
> -j $JOBID 

失败。发现建议:Use s3distcp试了一下也:

elastic-mapreduce --jobflow $JOBID \ 
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \ 
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \ 
> --arg --src --arg 'hdfs://my.bucket/prj1/seqfiles' \ 
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles' 

在这两种情况下,我也有同样的错误:的java.net.UnknownHostException:未知主机:my.bucket
下面完整的错误输出的第二种情况。

2012-09-06 13:25:08,209 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system 
java.net.UnknownHostException: unknown host: my.bucket 
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214) 
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1193) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1047) 
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) 
    at $Proxy1.getProtocolVersion(Unknown Source) 
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401) 
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) 
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:127) 
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:249) 
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:214) 
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1413) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:68) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1431) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256) 
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:431) 
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) 
    at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187) 

回答

4

我发现了一个错误:

  1. 的主要问题不在于

    的java.net.UnknownHostException:未知主机:my.bucket

但:

2012-09-06 13:27:33,909 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system 

所以。在源路径中添加1个斜杠后 - 作业开始时没有问题。 正确的命令是

elastic-mapreduce --jobflow $JOBID \ 
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \ 
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \ 
> --arg --src --arg 'hdfs:///my.bucket/prj1/seqfiles' \ 
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles' 

附:所以。这是工作。工作正确完成。我已经成功复制了30Gb文件的目录。