减少Apache Spark作业/应用程序的运行时间

我们试图实现一个简单的spark作业，它读取一个CSV文件（1行数据），并使用预先构建的随机森林模型对象进行预测。这项工作不包括任何数据预处理或数据处理。减少Apache Spark作业/应用程序的运行时间

我们在独立模式下运行spark，应用程序在本地运行。的配置如下： RAM：8GB 内存：40GB 芯数：2 火花版本：1.5.2 Scala的版本：2.10.5 输入文件大小：1KB（1行的数据）模型文件大小：1,595 KB（400树随机森林）

目前，spark-submit中的实现大约需要13秒。然而，运行时间是该应用程序，因此

巨大关注的是有没有办法来优化代码，使运行时间缩短至1或2秒？（高优先级）
我们注意到，启动时实际代码的执行时间约为7-8秒，设置上下文大约需要5-6秒，所以有一种方法可以在运行时保持火花上下文运行火花提交。

这里是应用程序代码

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 

object RF_model_App { 
    def main(args: Array[String]) { 

val conf = new SparkConf().setAppName("Simple Application") 
val sc = new SparkContext(conf) 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
import org.apache.spark.ml.Pipeline 
import org.apache.spark.ml.feature4.{RandomForestfeature4Model, RandomForestClassifier} 
import org.apache.spark.ml.evaluation.Multiclassfeature4Evaluator 
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} 
import org.apache.spark.sql.functions.udf 
import org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.ml.feature.StringIndexer 
import sqlContext.implicits._ 
val Test = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("/home/ubuntu/Test.csv") 
Test.registerTempTable("Test") 
val model_L1 = sc.objectFile[RandomForestfeature4Model]("/home/ubuntu/RF_L1.model").first() 

val toInt = udf[Int, String](_.toInt) 
val toDouble = udf[Double, String](_.toDouble) 
val featureDf = Test.withColumn("id1", toInt(Test("id1"))) .withColumn("id2", toInt(Test("id2"))) .withColumn("id3", toInt(Test("id3"))) .withColumn("id4", toInt(Test("id4"))) .withColumn("feature3", toInt(Test("feature3"))) .withColumn("feature9", toInt(Test("feature9"))) .withColumn("feature10", toInt(Test("feature10"))) .withColumn("feature12", toInt(Test("feature12"))) .withColumn("feature14", toDouble(Test("feature14"))) .withColumn("feature15", toDouble(Test("feature15"))) .withColumn("feature16", toInt(Test("feature16"))) .withColumn("feature17", toDouble(Test("feature17"))) .withColumn("feature18", toInt(Test("feature18"))) 

val feature4_index = new StringIndexer() .setInputCol("feature4") .setOutputCol("feature4_index") 
val feature6_index = new StringIndexer() .setInputCol("feature6") .setOutputCol("feature6_index") 
val feature11_index = new StringIndexer() .setInputCol("feature11") .setOutputCol("feature11_index") 
val feature8_index = new StringIndexer() .setInputCol("feature8") .setOutputCol("feature8_index") 
val feature13_index = new StringIndexer() .setInputCol("feature13") .setOutputCol("feature13_index") 
val feature2_index = new StringIndexer() .setInputCol("feature2") .setOutputCol("feature2_index") 
val feature5_index = new StringIndexer() .setInputCol("feature5") .setOutputCol("feature5_index") 
val feature7_index = new StringIndexer() .setInputCol("feature7") .setOutputCol("feature7_index") 
val vectorizer_L1 = new VectorAssembler() .setInputCols(Array("feature3", "feature2_index", "feature6_index", "feature4_index", "feature8_index", "feature7_index", "feature5_index", "feature10", "feature9", "feature12", "feature11_index", "feature13_index", "feature14", "feature15", "feature18", "feature17", "feature16")).setOutputCol("features_L1") 
val feature_pipeline_L1 = new Pipeline() .setStages(Array(feature4_index, feature6_index, feature11_index,feature8_index, feature13_index, feature2_index, feature5_index, feature7_index,vectorizer_L1)) 
val testPredict= feature_pipeline_L1.fit(featureDf).transform(featureDf) 
val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1)) 
val getid2 = udf((v: Int) => v) 
val L1_output = model_L1.transform(testPredict).select(getid2($"id2") as "id2",getid2($"prediction") as "L1_prediction",getPOne($"probability") as "probability") 

L1_output.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save("/home/L1_output") 

    } 
};

来源

2016-02-26 Gauthaam M

让我们开始用东西是完全错误：您使用

特性机制仅仅是不正确。 StringIndexer根据数据分布分配索引，因此相同的记录根据其他记录具有不同的编码。您应该使用相同的StringIndexerModel（-s）进行培训，测试和预测。
val getid2 = udf((v: Int) => v)只是一个昂贵的身份。

持续SparkContext

有多种工具，保持持续的背景下，包括job-server或Livy。

最后，您可以简单地使用Spark Streaming并只处理数据。

洗牌

您同时还使用repartition创建一个单一的，因此，我想一个CSV文件。这个操作非常昂贵，但是根据定义，它随机重新刷新RDD中的数据以创建更多或更少的分区并在其间进行平衡。这总是通过网络混洗所有数据。

其他考虑：

如果延迟是重要的，你只使用一个单一的，低性能的机器，不使用火花可言。这里没有什么可以获得的。一个好的本地图书馆在这种情况下可以做得更好。

注意：

我们不会访问您的数据或您的硬件，因此任何要求，希望缩短时间7秒是完全没有意义的。

来源

2016-02-26 15:00:26 zero323

减少Apache Spark作业/应用程序的运行时间

回答

相关问题