2016-02-26 37 views
8

我一直在使用org.apache.spark.ml.Pipeline进行机器学习任务。了解实际概率而不仅仅是预测标签尤为重要,而且我很难得到它。这里我正在做随机森林的二元分类任务。班级标签是“是”和“否”。我想输出标签“是”的概率。概率存储在DenseVector中作为流水线输出,例如[0.69,0.31],但我不知道哪一个对应于“是”(0.69或0.31?)。我想应该有一些从labelIndexer检索它?如何从Spark ML随机森林中获得对应于该类的概率

这里是我的训练模型

val sc = new SparkContext(new SparkConf().setAppName(" ML").setMaster("local")) 
val data = .... // load data from file 
val df = sqlContext.createDataFrame(data).toDF("label", "features") 
val labelIndexer = new StringIndexer() 
         .setInputCol("label") 
         .setOutputCol("indexedLabel") 
         .fit(df) 

val featureIndexer = new VectorIndexer() 
         .setInputCol("features") 
         .setOutputCol("indexedFeatures") 
         .setMaxCategories(2) 
         .fit(df) 


// Convert indexed labels back to original labels. 
val labelConverter = new IndexToString() 
    .setInputCol("prediction") 
    .setOutputCol("predictedLabel") 
    .setLabels(labelIndexer.labels) 

val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3)) 


// Train a RandomForest model. 
val rf = new RandomForestClassifier() 
    .setLabelCol("indexedLabel") 
    .setFeaturesCol("indexedFeatures") 
    .setNumTrees(10) 
    .setFeatureSubsetStrategy("auto") 
    .setImpurity("gini") 
    .setMaxDepth(4) 
    .setMaxBins(32) 

// Create pipeline 
val pipeline = new Pipeline() 
    .setStages(Array(labelIndexer, featureIndexer, rf,labelConverter)) 

// Train model 
val model = pipeline.fit(trainingData) 

// Save model 
sc.parallelize(Seq(model), 1).saveAsObjectFile("/my/path/pipeline") 

然后我将加载管道,并就新的数据预测的任务代码,这里是一段代码

// Ignoring loading data part 

// Create DF 
val testdf = sqlContext.createDataFrame(testData).toDF("features", "line") 
// Load pipeline 
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("/my/path/pipeline").first 

// My Question comes here : How to extract the probability that corresponding to class label "1" 
// This is my attempt, I would like to output probability for label "Yes" and predicted label . The probabilities are stored in a denseVector, but I don't know which one is corresponding to "Yes". Something like this: 
val predictions = model.transform(testdf).select("probability").map(e=> e.asInstanceOf[DenseVector]) 

参考关于向RF的概率和标签: http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests

+0

你说的意思是什么这个“我想输出概率标有‘1’,并预测标签。这些概率存储在DenseVector作为管道输出,但我不知道哪一个是对应的到“1”。“? – eliasah

+0

嗨,我已经更新了说明。基本上我想输出对应于标签“是”的概率。 – Qing

回答

-1

你的意思是你想提取正面标签的概率在DenseVector?如果是这样,你可以创建一个udf函数来解决概率。 在二元分类的DenseVector中,第一列表示“0”的概率,第二列表示“1”。

val prediction = pipelineModel.transform(result) 
val pre = prediction.select(getOne($"probability")).withColumnRenamed("UDF(probability)","probability") 
相关问题