我在CDH 5.2.0上使用Spark 1.1.0,并试图确保我可以读取和写入hdfs。Spark写入hdfs不能使用saveAsNewAPIHadoopFile方法
我很快意识到.textFile和.saveAsTextFile调用旧的API,似乎与我们的hdfs版本不兼容。
def testHDFSReadOld(sc: SparkContext, readFile: String){
//THIS WILL FAIL WITH
//(TID 0, dl1rhd416.internal.edmunds.com): java.lang.IllegalStateException: unread block data
//java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)
sc.textFile(readFile).take(2).foreach(println)
}
def testHDFSWriteOld(sc: SparkContext, writeFile: String){
//THIS WILL FAIL WITH
//(TID 0, dl1rhd416.internal.edmunds.com): java.lang.IllegalStateException: unread block data
//java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)
sc.parallelize(List("THIS","ISCOOL")).saveAsTextFile(writeFile)
}
移动到新的API方法固定从hdfs读取!
def testHDFSReadNew(sc: SparkContext, readFile: String){
//THIS WORKS
sc.newAPIHadoopFile(readFile, classOf[TextInputFormat], classOf[LongWritable],
classOf[Text],sc.hadoopConfiguration).map{
case (x:LongWritable, y: Text) => y.toString
}.take(2).foreach(println)
}
所以看来我正在取得进展。写作不再像上面那样出现严重错误,而是似乎正在工作。唯一的问题是,除了目录中孤独的SUCCESS标志文件之外,什么都没有。更令人困惑的是,日志显示数据正在写入_temporary目录。看起来好像文件提交者从未意识到需要将文件从_temporary目录移动到输出目录。
def testHDFSWriteNew(sc: SparkContext, writeFile: String){
/*This will have an error message of:
INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(dl1rhd400.internal.edmunds.com,35927)
14/11/21 02:02:27 INFO ConnectionManager: Key not valid ? [email protected]
14/11/21 02:02:27 INFO ConnectionManager: key already cancelled ? [email protected]
java.nio.channels.CancelledKeyException
at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
However lately it hasn't even had errors, symptoms are no part files in the directory but a success flag is there
*/
val conf = sc.hadoopConfiguration
conf.set("mapreduce.task.files.preserve.failedtasks", "true")
conf.set("mapred.output.dir", writeFile)
sc.parallelize(List("THIS","ISCOOL")).map(x => (NullWritable.get, new Text(x)))
.saveAsNewAPIHadoopFile(writeFile, classOf[NullWritable], classOf[Text], classOf[TextOutputFormat[NullWritable, Text]], conf)
}
当我在本地运行并指定hdfs路径时,文件在hdfs中显示正常。只有当我在我们的火花独立群集上运行时才会发生这种情况。
我提交作业如下: 火花提交--deploy模式客户端--master火花:// sparkmaster --class driverclass driverjar