2017-05-01 23 views
0

我正在处理spark中的ETL管道,我发现推送发布时间是带宽密集型的。我释放脚本(伪):在火花工人类路径中自定义JAR的最佳方法

sbt assembly 
openstack object create spark target/scala-2.11/etl-$VERSION-super.jar 
spark-submit \ 
    --class comapplications.WindowsETLElastic \ 
    --master spark://spark-submit.cloud \ 
    --deploy-mode cluster \ 
    --verbose \ 
    --conf "spark.executor.memory=16g" \ 
    "$JAR_URL" 

其工作,但可能需要4分钟内组装一分钟推。我的build.sbt:

name := "secmon_etl" 

version := "1.2" 

scalaVersion := "2.11.8" 

exportJars := true 

assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar" 

libraryDependencies ++= Seq (
    "org.apache.spark" %% "spark-core" % "2.1.0" % "provided", 
    "org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided", 
    "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0", 
    "io.spray" %% "spray-json" % "1.3.3", 
// "commons-net" % "commons-net" % "3.5", 
// "org.apache.httpcomponents" % "httpclient" % "4.5.2", 
    "org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1" 
) 

assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { 
    (old) => { 
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard 
    case x => MergeStrategy.first 
    } 
} 

该问题似乎是elasticsearch-spark-20_2.11的庞大规模。它为我的uberjar增加了大约90MB。我很乐意将它转换为对火花主机的依赖关系provided,使其不必进行封装。问题是,最好的办法是什么?我应该手动复制jar吗?还是有一种指定依赖关系并使用工具解决所有传递依赖关系的万无一失的方法?

回答

0

我有我的spark工作正在运行,现在要快得多。我跑

sbt assemblyPackageDependency 

这产生了巨大的罐子(110MB!),轻松放入火花工作目录“罐子”的文件夹,所以现在我Dockerfile了火花簇看起来是这样的:

FROM openjdk:8-jre 

ENV SPARK_VERSION 2.1.0 
ENV HADOOP_VERSION hadoop2.7 
ENV SPARK_MASTER_OPTS="-Djava.net.preferIPv4Stack=true" 

RUN apt-get update && apt-get install -y python 

RUN curl -sSLO http://mirrors.ocf.berkeley.edu/apache/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz && tar xzfC /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz /usr/share && rm /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz 

# master or worker's webui port, 
EXPOSE 8080 
# master's rest api port 
EXPOSE 7077 

ADD deps.jar /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION/jars/ 

WORKDIR /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION 

部署的配置后,我改变了我的build.sbt所以kafka-streaming/elasticsearch-spark罐子和依赖标记为provided

name := "secmon_etl" 

version := "1.2" 

scalaVersion := "2.11.8" 

exportJars := true 

assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar" 

libraryDependencies ++= Seq (
    "org.apache.spark" %% "spark-core" % "2.1.0" % "provided", 
    "org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided", 

    "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0" % "provided", 
    "io.spray" %% "spray-json" % "1.3.3" % "provided", 
    "org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1" % "provided" 
) 

assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { 
    (old) => { 
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard 
    case x => MergeStrategy.first 
    } 
} 

现在我展开时在20秒内完成!

+0

我跑到一个错误的地方,主无法重新启动。超级罐不能在主路径上,否则它会自动运行一些代码并破坏动物园管理员连接代码。 – xrl

相关问题