2017-07-07 41 views
1
ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 20) 
scala.MatchError: [0.0,(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) 

我看到这个错误与我的Scala程序,我试图使用NaiveBayes分类器分类电影评论。我在尝试培训NaiveBayes Classifer时遇到了这个错误。我无法更正此错误,因为我不知道分类器所期望的数据类型。 NaiveBayes的文档说它期望我有的RDD条目。任何帮助将不胜感激。请为这部电影评论分类程序找到我的完整SCALA代码。斯卡拉的位置:MatchError

PS:请忽略代码中可能出现的缩进错误。它在我的程序文件中是正确的。提前致谢。

import org.apache.spark.sql.{Dataset, DataFrame, SparkSession} 
    import org.apache.spark.{SparkConf, SparkContext} 
    import org.apache.spark.sql.functions._ 
    import org.apache.spark.sql._ 
    import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer, PCA} 
    import org.apache.spark.mllib.classification.{NaiveBayes,NaiveBayesModel} 
    import org.apache.spark.mllib.util.MLUtils 
    import org.apache.spark.mllib.regression.LabeledPoint 
    import org.apache.spark.mllib.linalg._ 


    //Reading the file from csv into dataframe object 
    val sqlContext = new SQLContext(sc) 
    val df = sqlContext.read.option("header", "true").option("delimiter",",").option("inferSchema", "true").csv("movie-pang02.csv") 


    //Tokenizing the data by splitting the text into words 
    val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") 
    val wordsData = tokenizer.transform(df) 


    //Hashing the data by converting the words into rawFeatures 
    val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200) 
    val featurizedData = hashingTF.transform(wordsData) 


     //Applying Estimator on the data which converts the raw features into features by scaling each column 
     val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") 
     val idfModel = idf.fit(featurizedData) 
     val rescaledData = idfModel.transform(featurizedData) 

     val coder: (String => Int) = (arg: String) => {if (arg == "Pos") 1 else 0} 
     val sqlfunc = udf(coder) 
     val new_set = rescaledData.withColumn("label", sqlfunc(col("class"))) 

     val EntireDataRdd = new_set.select("label","features").map{case Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray))} 


    //Converted the data into RDD<LabeledPoint> format so as to input it into the inbuilt Naive Bayes classifier 
    val labeled = EntireDataRdd.rdd 
    val Array(trainingData, testData) = labeled.randomSplit(Array(0.7, 0.3), seed = 1234L) 
    //Error in the following statement 
    val model = NaiveBayes.train(trainingData, lambda = 1.0, modelType = "multinomial") 

    val predictionAndLabel = testData.map(p => (model.predict(p.features), p.label)) 
    val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count()/testData.count() 
    val testErr = predictionAndLabel.filter(r => r._1 != r._2).count.toDouble/testData.count() 

回答

1

这是一个痛苦(而不是常见)陷阱 - 你符合到错误Vector类的内容 - 它应该是org.apache.spark.ml.linalg.Vector而不是org.apache.spark.mllib.linalg.Vector ...(是的! - 沮丧)

添加映射之前自营进出口经营权解决了这个问题:

import org.apache.spark.ml.linalg.Vector // and not org.apache.spark.mllib.linalg.Vector! 
import org.apache.spark.mllib.linalg.Vectors // and not org.apache.spark.ml.linalg.Vectors! 

val EntireDataRdd = new_set.select("label","features").map { 
    case Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray)) 
} 
+0

“这是一个痛苦(而不是常见)陷阱”。不可能变得更好。 虽然,我已经尝试这样做,这是我遇到的错误: 发现:org.apache.spark.ml.linalg.Vector需要 :org.apache.spark.mllib.linalg.Vector VAL EntireDataRdd = new_set。 select(“label”,“features”)。map {case Row(label:Int,features:Vector)=> LabeledPoint(label.toDouble,Vectors.dense(features.toArray))} –

+0

不需要永远感受负债 - 接受答案就足够了) –

+0

我已经试过这个,这是我遇到的错误:发现:org.apache.spark.ml.linalg.Vector必需:org.apache.spark.mllib.linalg.Vector val EntireDataRdd (标签:Int,features:Vector)=> LabeledPoint(label.toDouble,Vectors.dense(features.toArray))} 出现此错误= new_set.select(“label”,“features”在下面g行: val EntireDataRdd = new_set.select(“label”,“features”)。map {case Row(label:Int,features:Vector)=> LabeledPoint(label.toDouble,Vectors.dense(features.toArray)) } –