2014-06-28 55 views
6

我想实现使用java和apache火花1.0.0版本的决策树分类器的简单演示。我基于http://spark.apache.org/docs/1.0.0/mllib-decision-tree.html。到目前为止,我编写了下面列出的代码。与java的apache的火花决策树实现问题

与下面的代码行,我得到错误:

org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy(); 

类型不匹配:不能转换从熵的杂质。 真奇怪,我,一边类熵实现杂质接口:

https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/mllib/tree/impurity/Entropy.html

我在找问题,为什么我不能做这个作业的答案吗?

package decisionTree; 

import java.util.regex.Pattern; 

import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.linalg.Vectors; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.tree.DecisionTree; 
import org.apache.spark.mllib.tree.configuration.Algo; 
import org.apache.spark.mllib.tree.configuration.Strategy; 
import org.apache.spark.mllib.tree.impurity.Gini; 
import org.apache.spark.mllib.tree.impurity.Impurity; 

import scala.Enumeration.Value; 

public final class DecisionTreeDemo { 

    static class ParsePoint implements Function<String, LabeledPoint> { 
     private static final Pattern COMMA = Pattern.compile(","); 
     private static final Pattern SPACE = Pattern.compile(" "); 

     @Override 
     public LabeledPoint call(String line) { 
      String[] parts = COMMA.split(line); 
      double y = Double.parseDouble(parts[0]); 
      String[] tok = SPACE.split(parts[1]); 
      double[] x = new double[tok.length]; 
      for (int i = 0; i < tok.length; ++i) { 
       x[i] = Double.parseDouble(tok[i]); 
      } 
      return new LabeledPoint(y, Vectors.dense(x)); 
     } 
    } 

    public static void main(String[] args) throws Exception { 

     if (args.length < 1) { 
      System.err.println("Usage:DecisionTreeDemo <file>"); 
      System.exit(1); 
     } 

     JavaSparkContext ctx = new JavaSparkContext("local[4]", "Log Analizer", 
       System.getenv("SPARK_HOME"), 
       JavaSparkContext.jarOfClass(DecisionTreeDemo.class)); 

     JavaRDD<String> lines = ctx.textFile(args[0]); 
     JavaRDD<LabeledPoint> points = lines.map(new ParsePoint()).cache(); 

     int iterations = 100; 

     int maxBins = 2; 
     int maxMemory = 512; 
     int maxDepth = 1; 

     org.apache.spark.mllib.tree.impurity.Impurity impurity = new org.apache.spark.mllib.tree.impurity.Entropy(); 

     Strategy strategy = new Strategy(Algo.Classification(), impurity, maxDepth, 
       maxBins, null, null, maxMemory); 

     ctx.stop(); 
    } 
} 

@samthebest如果删除杂质变量和更改为如下形式:改变为

Strategy strategy = new Strategy(Algo.Classification(), new org.apache.spark.mllib.tree.impurity.Entropy(), maxDepth, maxBins, null, null, maxMemory); 

错误:构造熵()是未定义的。

[编辑] 我发现,我认为方法的正确调用(https://issues.apache.org/jira/browse/SPARK-2197):

Strategy strategy = new Strategy(Algo.Classification(), new Impurity() { 
@Override 
public double calculate(double arg0, double arg1, double arg2) 
{ return Gini.calculate(arg0, arg1, arg2); } 

@Override 
public double calculate(double arg0, double arg1) 
{ return Gini.calculate(arg0, arg1); } 

}, 5, 100, QuantileStrategy.Sort(), null, 256); 

不幸的是我遇到的bug :(

+1

奇数。尝试将它内联而不是分配给变量。毕竟你只使用一次变量。也真的推荐使用Scala而不是Java API,你可以用几行代码完成整个事情,阅读起来会更容易。 – samthebest

回答

0
的错误2197

一个Java的解决方案现已上市,通过this pull request

Other improvements to Decision Trees for easy-of-use with Java: * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor --> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently. I suspect we will redo the API before the other options are included.

你可以看到一个完整的例子,这是使用intance()方法基尼杂质here

Strategy strategy = new Strategy(Algo.Classification(), Gini.instance(), maxDepth, numClasses,maxBins, categoricalFeaturesInfo); 
DecisionTreeModel model = DecisionTree$.MODULE$.train(rdd.rdd(), strategy);