2016-12-05 35 views
0

我是Scala和Spark中的新成员。我使用回归代码(基于此链接Spark official site上):均方误差(MSE)返回一个庞大的数字

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.regression.LinearRegressionModel 
import org.apache.spark.mllib.regression.LinearRegressionWithSGD 
import org.apache.spark.mllib.linalg.Vectors 

// Load and parse the data 
val data = sc.textFile("Year100") 
val parsedData = data.map { line => 
    val parts = line.split(',') 
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 
}.cache() 

// Building the model 
val numIterations = 100 
val stepSize = 0.00000001 
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize) 

// Evaluate model on training examples and compute training error 
val valuesAndPreds = parsedData.map { point => 
    val prediction = model.predict(point.features) 
    (point.label, prediction) 
    } 
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() 
println("training Mean Squared Error = " + MSE) 

,我使用这里可以看到的数据集:Pastebin link

所以我的问题是:为什么MSE等于889717.74(这是一个庞大的数字)?

编辑:正如论者建议,我想这些:

1)我改变了一步违约和MSE现在返回为NaN的

2)如果我尝试这个构造: LinearRegressionWithSGD.train (parsedData,numIterations,stepSize,intercept = True)spark-shell返回一个错误(error:not found:value True)

+0

[pyspark Linear Regression Example from official documentation - Bad results?]的可能副本(http://stackoverflow.com/questions/33842982/pyspark-linear-regression-example-from-official-documentation-bad-results) – 2016-12-05 22:24:25

回答

0

您已经通过了一个微小的步长,并将迭代次数限制在100。您的参数可以更改的值是0.00000001 * 100 = 0.000001 。尝试使用默认步长,我想这会解决它。