2016-11-10 52 views
2

将分类变量(字符串和整数)包含到MLlib算法的特征中的正确或最佳方法是什么?Spark MLlib:包括分类特征

在分类变量上使用OneHotEncoder s是否正确,然后将其他列的输出列包含在VectorAssembler中,如下面的代码中所示?

的原因是,我最终像这样与行的数据帧中,它看起来像feature3feature4组合看起来他们是作为单独的两个分类功能同等重要的“等级”。

+------------------+-----------------------+---------------------------+ 
|prediction  |actualVal |features        | 
+------------------+-----------------------+---------------------------+ 
|355416.44924898935|990000.0 |(17,[0,1,2,3,4,5,10,15],[1.0,206.0]) | 
|358917.32988024893|210000.0 |(17,[0,1,2,3,4,5,10,15,16],[1.0,172.0]) | 
|291313.84175674635|4600000.0 |(17,[0,1,2,3,4,5,12,15,16],[1.0,239.0]) | 

这里是我的代码:

val indexer = new StringIndexer() 
    .setInputCol("stringFeatureCode") 
    .setOutputCol("stringFeatureCodeIndex") 
    .fit(data) 
val indexed = indexer.transform(data) 

val encoder = new OneHotEncoder() 
    .setInputCol("stringFeatureCodeIndex") 
    .setOutputCol("stringFeatureCodeVec") 

var encoded = encoder.transform(indexed) 

encoded = encoded.withColumn("intFeatureCodeTmp", encoded.col("intFeatureCode") 
    .cast(DoubleType)) 
    .drop("intFeatureCode") 
    .withColumnRenamed("intFeatureCodeTmp", "intFeatureCode") 

val intFeatureCodeEncoder = new OneHotEncoder() 
    .setInputCol("intFeatureCode") 
    .setOutputCol("intFeatureCodeVec") 

encoded = intFeatureCodeEncoder.transform(encoded) 

val assemblerDeparture = 
    new VectorAssembler() 
    .setInputCols(
     Array("stringFeatureCodeVec", "intFeatureCodeVec", "feature3", "feature4")) 
    .setOutputCol("features") 
var data2 = assemblerDeparture.transform(encoded) 

val Array(trainingData, testData) = data2.randomSplit(Array(0.7, 0.3)) 

val rf = new RandomForestRegressor() 
    .setLabelCol("actualVal") 
    .setFeaturesCol("features") 
    .setNumTrees(100) 

回答

1
  • 一般来说,这是一个推荐的方法。
  • 当工作树模型是不必要的,应该避免。您只能使用StringIndexer
+0

这是什么意思?仅限StringIndexer?如何将索引列提供给决策树?他们采取一列特征向量... – rjurney