2015-05-05 21 views
3

我试图提交火花作业指定spark-csv包作为依赖性:如何将应用程序提交到纱线群集,以便包装中的罐子也被复制?

spark/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --deploy-mode cluster --master yarn-cluster script.py 

,但我得到了以下异常(片段)

15/05/05 22:23:46 INFO yarn.Client: Source and destination file systems are the same. Not copying /home/hadoop/.ivy2/jars/spark-csv_2.10.jar 
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://172.31.13.205:9000/home/hadoop/.ivy2/jars/spark-csv_2.10.jar 

spark集群的安装和配置有以下脚本:

aws emr create-cluster --name sandbox --ami-version 3.6 --instance-type m3.xlarge --instance-count 3 \ 
    --ec2-attributes KeyName=sandbox \ 
    --applications Name=Hive \ 
    --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \ 
    --log-uri s3://mybucket/spark-logs \ 
    --steps \ 
    Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server \ 
    Name=SparkConfigure,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://support.elasticmapreduce/spark/configure-spark.bash,spark.default.parallelism=100,spark.locality.wait.rack=0] 

这应该广泛适用于Spark开发人员,因为我想象使用EMR和Spark并不是一个不常见的工作流程,我也没有做太复杂的事情。

这里的扩展堆栈跟踪:

Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Ivy Default Cache set to: /home/hadoop/.ivy2/cache 
The jars for the packages stored in: /home/hadoop/.ivy2/jars 
:: loading settings :: url = jar:file:/home/hadoop/.versions/spark-1.3.0.d/lib/spark-assembly-1.3.0-hadoop2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml 
com.databricks#spark-csv_2.10 added as a dependency 
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 
    confs: [default] 
    found com.databricks#spark-csv_2.10;1.0.3 in central 
    found org.apache.commons#commons-csv;1.1 in central 
:: resolution report :: resolve 238ms :: artifacts dl 8ms 
    :: modules in use: 
    com.databricks#spark-csv_2.10;1.0.3 from central in [default] 
    org.apache.commons#commons-csv;1.1 from central in [default] 
    --------------------------------------------------------------------- 
    |     |   modules   || artifacts | 
    |  conf  | number| search|dwnlded|evicted|| number|dwnlded| 
    --------------------------------------------------------------------- 
    |  default  | 2 | 0 | 0 | 0 || 2 | 0 | 
    --------------------------------------------------------------------- 
:: retrieving :: org.apache.spark#spark-submit-parent 
    confs: [default] 
    0 artifacts copied, 2 already retrieved (0kB/10ms) 
15/05/05 22:07:23 INFO client.RMProxy: Connecting to ResourceManager at /172.31.13.205:9022 
15/05/05 22:07:23 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers 
15/05/05 22:07:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container) 
15/05/05 22:07:23 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 
15/05/05 22:07:23 INFO yarn.Client: Setting up container launch context for our AM 
15/05/05 22:07:23 INFO yarn.Client: Preparing resources for our AM container 
15/05/05 22:07:24 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.0.d/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://172.31.13.205:9000/user/hadoop/.sparkStaging/application_1430862769169_0005/spark-assembly-1.3.0-hadoop2.4.0.jar 
15/05/05 22:07:24 INFO metrics.MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false 
15/05/05 22:07:24 INFO metrics.MetricsSaver: Created MetricsSaver j-3C91V87M8TXWD:i-e4bd8f2d:SparkSubmit:05979 period:60 /mnt/var/em/raw/i-e4bd8f2d_20150505_SparkSubmit_05979_raw.bin 
15/05/05 22:07:25 INFO yarn.Client: Source and destination file systems are the same. Not copying /home/hadoop/.ivy2/jars/spark-csv_2.10.jar 
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://172.31.13.205:9000/home/hadoop/.ivy2/jars/spark-csv_2.10.jar 
    at org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:129) 
    at org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:460) 
    at org.apache.hadoop.fs.FileContext$23.next(FileContext.java:2120) 
    at org.apache.hadoop.fs.FileContext$23.next(FileContext.java:2116) 
    at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) 
    at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2116) 
    at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:591) 
    at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:203) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4$$anonfun$apply$1.apply(Client.scala:285) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4$$anonfun$apply$1.apply(Client.scala:280) 
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4.apply(Client.scala:280) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4.apply(Client.scala:278) 
    at scala.collection.immutable.List.foreach(List.scala:318) 
    at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:278) 
    at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384) 
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102) 
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:619) 
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647) 
    at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:606) 
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) 
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) 
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) 
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) 
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
15/05/05 22:07:25 INFO metrics.MetricsSaver: Saved 3:3 records to /mnt/var/em/raw/i-e4bd8f2d_20150505_SparkSubmit_05979_raw.bin 
Command exiting with ret '1' 
+0

你为什么使用EMR?与普通EC2相比,有什么优势?在EC2上运行Spark的[官方脚本](https://spark.apache.org/docs/1.3.1/ec2-scripts.html)。电子病历是不是使事情复杂化,成本更高? –

+1

@DanielDarabos我切换到Spark附带的'spark-ec2'脚本,我没有任何问题。 –

+2

@DanielDarabos实际上有很多区别。主要是集群的正常运行时间。如果使用ec2脚本为大约50多台机器设置集群,则需要超过45分钟才能完成并准备使用它们。 EMR在不到一半的时间内完成了这项工作。 Plus EMR允许您非常方便地自动化批量火花作业。当你需要使用spark-ec2脚本来做到这一点时,它会很痛苦。特别是在任务失败的情况下记录日志。 – Sohaib

回答

2

我觉得这可能是一个Apache星火错误,虽然我没有看到它在Spark JIRA报道。然而,http://apache-spark-user-list.1001560.n3.nabble.com/Resources-not-uploaded-when-submitting-job-in-yarn-client-mode-td21516.html似乎描述了相同的情况。根据那篇文章,问题是在您的部署设置中,Spark错误地认为目标系统与客户端系统相同,因此它放弃了复制:

15/05/05 22:07:25 INFO yarn.Client:源文件系统和目标文件系统是相同的。不是抄袭/home/hadoop/.ivy2/jars/spark-csv_2.10.jar

我建议你尝试--jars代替--packages(见Submitting Applications)。如果可行,请提交有关此问题的错误!

+0

解决此问题的设置是什么? – nish1013

+0

不知道。 '--jars'而不是'--packages'有帮助吗?最近发布在https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%[email protected].com%3E上的帖子表明,也许你只需要一个“核心站点”。 xml'文件。 –

+0

我已经有了一个core-site.xml。我从Ambari的“下载客户端配置”选项下载了YARN服务。这是我复制到我的开发机器Hadoop配置的版本 – nish1013

相关问题