2015-06-18 46 views
1

我想在本地使用spark。我的环境是在apache Spark中读取本地Windows文件

  1. Eclipse Luna预构建scala支持。
  2. 创建一个项目并转换为maven并添加Spark核心依赖项Jar。
  3. 下载WinUtils.exe并设置HADOOP_HOME路径。

我试图运行的代码是

object HelloWorld { 
     def main(args: Array[String]) { 
      println("Hello, world!") 
    /*  val master = args.length match { 
      case x: Int if x > 0 => args(0) 
      case _ => "local" 
      }*/ 
      /*val sc = new SparkContext(master, "BasicMap", System.getenv("SPARK_HOME"))*/ 
      val conf = new SparkConf().setAppName("HelloWorld").setMaster("local[2]").set("spark.executor.memory","1g") 
      val sc = new SparkContext(conf) 
     val input = sc.textFile("C://Users//user name//Downloads//error.txt") 
    // Split it up into words. 
    val words = input.flatMap(line => line.split(" ")) 
    // Transform into pairs and count. 
    val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y} 
      counts.foreach(println) 

但是,当我使用sparkContext读取文件时,出现以下错误:

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/Downloads/error.txt 
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65) 
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290) 
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) 
at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289) 
at com.examples.HelloWorld$.main(HelloWorld.scala:23) 
at com.examples.HelloWorld.main(HelloWorld.scala) 

可有人向我提供洞察力上如何克服这个错误?

+0

你有你的cygwin的道路上? – abalcerek

+0

@us er52045不,我没有cygwin。 – Satya

+0

我很确定你需要它。 – abalcerek

回答

0

问题是用户名有空间是创建所有问题。一旦我移动到没有空格的文件路径,它工作正常。

0

它在sparksession.builder为我工作的W10 火花2 为 () 的.config( “spark.sql.warehouse.dir”,“文件:///”)

和路径使用\的...

PS一定要放满扩展名的文件

[本地] [文件] [spark2]