0
我正在处理spark中的ETL管道,我发现推送发布时间是带宽密集型的。我释放脚本(伪):在火花工人类路径中自定义JAR的最佳方法
sbt assembly
openstack object create spark target/scala-2.11/etl-$VERSION-super.jar
spark-submit \
--class comapplications.WindowsETLElastic \
--master spark://spark-submit.cloud \
--deploy-mode cluster \
--verbose \
--conf "spark.executor.memory=16g" \
"$JAR_URL"
其工作,但可能需要4分钟内组装一分钟推。我的build.sbt:
name := "secmon_etl"
version := "1.2"
scalaVersion := "2.11.8"
exportJars := true
assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar"
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0",
"io.spray" %% "spray-json" % "1.3.3",
// "commons-net" % "commons-net" % "3.5",
// "org.apache.httpcomponents" % "httpclient" % "4.5.2",
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1"
)
assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) {
(old) => {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
该问题似乎是elasticsearch-spark-20_2.11的庞大规模。它为我的uberjar增加了大约90MB。我很乐意将它转换为对火花主机的依赖关系provided
,使其不必进行封装。问题是,最好的办法是什么?我应该手动复制jar吗?还是有一种指定依赖关系并使用工具解决所有传递依赖关系的万无一失的方法?
我跑到一个错误的地方,主无法重新启动。超级罐不能在主路径上,否则它会自动运行一些代码并破坏动物园管理员连接代码。 – xrl