使用已安装的spark和maven将Spark Scala程序编译为jar文件

仍尝试熟悉Maven并将我的源代码编译为用于spark-submit的jar文件。我知道如何使用IntelliJ，但想知道这是如何实际工作的。我有一台EC2服务器，已经安装了所有最新的软件，例如spark和scala，并且有我想现在用maven编译的SparkPi.scala源代码示例。我的愚蠢问题首先是，我可以使用我的已安装软件来构建代码，而不是从Maven存储库检索依赖项，以及如何从基本的pom.xml模板开始添加相应的需求。我不完全理解maven究竟在做什么，我怎么才能测试我的源代码的编译？据我所知，我只需要有标准目录结构src/main/scala然后想要运行mvn package。另外我想用maven而不是sbt来测试。使用已安装的spark和maven将Spark Scala程序编译为jar文件

来源

2016-06-20 horatio1701d

取决于你想达到的目标。在本地机器上运行示例或在群集上运行示例。 –

尝试在EC2上的Spark集群上运行示例。我知道如何使用IntelliJ在本地进行编译，但在服务器上编译源代码的正确方法是什么。 – horatio1701d

加成@Krishna，如果你有mvn project，在pom.xml使用mvn clean package。请确保您的pom.xml中有以下build以制造fat-jar。（这是我的情况，我怎么做罐子）

<build><sourceDirectory>src</sourceDirectory> 
     <plugins><plugin> 
      <artifactId>maven-compiler-plugin</artifactId> 
      <version>3.0</version> 
      <configuration> 
       <source>1.7</source> 
       <target>1.7</target> 
      </configuration> 
     </plugin> 
      <plugin> 
      <groupId>org.apache.maven.plugins</groupId> 
      <artifactId>maven-assembly-plugin</artifactId> 
      <version>2.4</version> 
      <configuration> 
       <descriptorRefs> 
        <descriptorRef>jar-with-dependencies</descriptorRef> 
       </descriptorRefs> 
      </configuration> 
      <executions> 
       <execution> 
        <id>assemble-all</id> 
        <phase>package</phase> 
        <goals> 
         <goal>single</goal> 
        </goals> 
       </execution> 
      </executions> 
     </plugin></plugins> 
    </build>

更多细节：link 如果你有sbt project，用sbt clean assembly使fat-jar。为此你需要以下配置，作为一个例子，build.sbt

assemblyJarName := "WordCountSimple.jar" 
// 
val meta = """META.INF(.)*""".r 

assemblyMergeStrategy in assembly := { 
    case PathList("javax", "servlet", [email protected]_*) => MergeStrategy.first 
    case PathList([email protected]_*) if ps.last endsWith ".html" => MergeStrategy.first 
    case n if n.startsWith("reference.conf") => MergeStrategy.concat 
    case n if n.endsWith(".conf") => MergeStrategy.concat 
    case meta(_) => MergeStrategy.discard 
    case x => MergeStrategy.first 
}

而且plugin.sbt像：

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

更多看到this和this。

直到这里，主要目标是获得与目标文件夹中的所有依赖的fat-jar。使用罐子集群像这样运行：

[email protected]:/usr/local/spark$ ./bin/spark-submit --class com.hastimal.wordcount --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300 --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/wordcount.jar inputRDF/data_all.txt /output

这里我有inputRDF/data_all.txt /output是有两个参数。同样在工具的角度来看，我在IDE中构建了Intellij。

来源

2016-06-21 16:10:31 ChikuMiku

谢谢。所以基本上我了解它。你不需要在你的pom.xml中包含spark或scala依赖项？想要了解为什么我有时会看到pom.xml中的所有软件依赖关系，而不是像您所示的那样将其保留。当你在IntelliJ中编写代码时，你是否简单地将spark和scala添加为模块，然后运行基本的maven构建来为spark-submit创建fat-jar？ – horatio1701d

@ prometheus2305对你的问题的简短回答：1.是的，我们需要所有依赖，这是App的必需，在'build.sbt'或'pom.xml'中。 2.我正在制作Scala-SBT项目，然后在'build.sbt'和'plugin.sbt'中添加东西。这是我知道的最简单的方法。使用我上面提到的链接。 – ChikuMiku

还是有点困惑。如果我只需要将项目打包到jar中以便在spark上提交，并在已经具有spark的单独远程集群上进行提交，我是否需要在pom.xml中显式添加spark和scala以将这些依赖项包含在jar中，还是只是只需要一个最小的maven来编译和创建一个jar文件？ – horatio1701d

请按照下列步骤

# create assembly jar upon code change 
sbt assembly 

# transfer the jar to a cluster 
scp target/scala-2.10/myproject-version-assembly.jar <some location in your cluster> 

# fire spark-submit on your cluster 
$SPARK_HOME/bin/spark-submit --class not.memorable.package.applicaiton.class --master yarn --num-executor 10 \ 
    --conf some.crazy.config=xyz --executor-memory=lotsG \ 
    myproject-version-assembly.jar \ 
    <glorious-application-arguments...>

来源

2016-06-21 14:20:18

使用已安装的spark和maven将Spark Scala程序编译为jar文件

回答

相关问题