2016-09-23 64 views
2

我的EMR位于us-west-1,但是我的S3存储区位于我们东区-1,并且出现错误。来自AWS EMR的跨区域S3访问Spark

我试过s3://{bucketname}.s3.amazon.com但这会创建一个新的存储桶s3.amazon.com

如何访问s3存储区跨区域?

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Moved Permanently (Service: Amazon S3; Status Code: 301; Error Code: 301 Moved Permanently; Request ID: FB1139D9BD8F409B), S3 Extended Request ID: pWK3X9BBRp8BLlXEHOx008RCdlZC64YFTounDYGtnwsAneR0IDP1Z/gmDudRoqWhDArfYLNRxk4= 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) 
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) 
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) 
    at com.sun.proxy.$Proxy38.retrieveMetadata(Unknown Source) 
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) 
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) 
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) 
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) 
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) 
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) 
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) 
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) 
    at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:520) 
+0

这曾经被支持,但不知何故最近改变在EMR中。似乎不再允许访问不同地区的S3存储桶。它似乎也影响了历史AMI,所以这是EMR本身的变化,而不是与emr-5.0相关。 –

+0

是的,我们正在使用EMR 4.6进行跨区域s3访问,并且使用EMR 5.0进行spark 2.0升级时出现此问题。我希望有一个明确的方式,我可以通过使用'class InstanceProfileCredentialsProvider'或类似的东西来设置不同的区域... – codingtwinky

+1

@JohnRotenstein这是有问题的。我没有遇到过这样的问题,但我们在这种情况下做了什么?请不要告诉我们必须使用S3 API将数据从一个区域复制到另一个区域,以便我们可以访问它。而更为荒谬的是历史AMI受其影响。这是一个巨大的回归。 – eliasah

回答

3

该解决方案为我工作在EMR-5.0.0/EMR-5.0.3:

以下属性添加到core-site configuration

"fs.s3n.endpoint":"s3.amazonaws.com" 
+1

终于有一段时间来测试这个。它似乎在为s3n,s3a和s3工作。它最近可能已经发布了EMR 5.1.0,但发行说明未指定。 http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html – codingtwinky

0

如建议通过在注释中@codingtwinky,EMR 4.6.0没有在emr.hadoop.fs层这个问题。我的hadoop作业现在可以在EMR 4.6.0中使用,但不适用于5.0.0或4.7.0。

1

联系到AWS支持团队,TLDR是他们意识到这个问题,他们目前正在研究这个问题,并希望为下一个EMR版本解决这个问题,但我没有eta。

对于“s3a”,您可以在运行时使用自定义s3 end points within spark,但这不适用于“s3”或“s3n”。

此外,您可以将EMR配置为在创建时指向另一个s3区域,但一旦以这种方式进行配置,您就会陷入该区域。

根据支持团队的说明,此EMRFS的区域绑定应用于EMR 4.7.2之后。