2016-02-19 41 views
3

我们使用的是Apache 1.6星火,斯卡拉2.10.5时,SBT 0.13.9错误执行的Apache火花ML管道

当执行一个简单的管道:

def buildPipeline(): Pipeline = { 
    val tokenizer = new Tokenizer() 
    tokenizer.setInputCol("Summary") 
    tokenizer.setOutputCol("LemmatizedWords") 
    val hashingTF = new HashingTF() 
    hashingTF.setInputCol(tokenizer.getOutputCol) 
    hashingTF.setOutputCol("RawFeatures") 

    val pipeline = new Pipeline() 
    pipeline.setStages(Array(tokenizer, hashingTF)) 
    pipeline 
} 

当执行ML管道拟合方法得到下面的错误。 任何关于可能发生的事情的意见都会有所帮助。

**java.lang.RuntimeException: error reading Scala signature of org.apache.spark.mllib.linalg.Vector: value linalg is not a package** 

[error] org.apache.spark.ml.feature.HashingTF$$typecreator1$1.apply(HashingTF.scala:66) 
[error] org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:642) 

[error] org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) 
[error] org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630) 
[error] org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) 
[error] org.apache.spark.sql.functions$.udf(functions.scala:2576) 
[error] org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:66) 
[error] org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:297) 
[error] org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:297) 
[error] org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:297) 
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) 
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) 

build.sbt

scalaVersion in ThisBuild := "2.10.5" 
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8") 

val sparkV = "1.6.0" 
val sprayV = "1.3.2" 
val specs2V = "2.3.11" 
val slf4jV = "1.7.5" 
val grizzledslf4jV = "1.0.2" 
val akkaV = "2.3.14" 

libraryDependencies in ThisBuild ++= { Seq(
    ("org.apache.spark" %% "spark-mllib" % sparkV) % Provided, 
    ("org.apache.spark" %% "spark-core" % sparkV) % Provided, 
    "com.typesafe.akka" %% "akka-actor" % akkaV, 
    "io.spray" %% "spray-can" % sprayV, 
    "io.spray" %% "spray-routing" % sprayV, 
    "io.spray" %% "spray-json" % sprayV, 
    "io.spray" %% "spray-testkit" % "1.3.1" % "test", 
    "org.specs2" %% "specs2-core" % specs2V % "test", 
    "org.specs2" %% "specs2-mock" % specs2V % "test", 
    "org.specs2" %% "specs2-junit" % specs2V % "test", 
    "org.slf4j" % "slf4j-api" % slf4jV, 
    "org.clapper" %% "grizzled-slf4j" % grizzledslf4jV 
) } 
+0

感谢您花时间研究此问题。添加spark-sql没有影响。另一方面,如果流水线运行在期货背景之外运行,问题似乎不会发生。任何想法为什么这可能是这种情况? – Krys

+0

我不认为“未来”本身就是一个问题。更可能是执行上下文。你可以添加一些解释你如何使用它?一个MCVE也许? – zero323

+0

你如何运行这个例子。这可能在'sbt console'里面吗? '提供'库不包括在内。 –

回答

0

你应该尝试使用

org.apache.spark.ml.linalg.Vector和

org.apache.spark.mllib.linalg.Vectors超过你现在使用的是

org.apache.spark.mllib.linalg.Vectors

希望这可以解决您的问题。