2017-04-06 29 views
1

我试图存储一个java对RDD作为一个Hadoop序列文件如下串行:星火saveAsNewAPIHadoopFile产生java.io.IOException:找不到值类

JavaPairRDD<ImmutableBytesWritable, Put> putRdd = ... 
config.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization"); 
putRdd.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, Put.class, SequenceFileOutputFormat.class, config); 

但我得到的异常即使我设置io.serializations

2017-04-06 14:39:32,623 ERROR [Executor task launch worker-0] executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) 
java.io.IOException: Could not find a serializer for the Value class: 'org.apache.hadoop.hbase.client.Put'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization. 
    at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1192) 
    at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1094) 
    at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:273) 
    at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:530) 
    at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getSequenceWriter(SequenceFileOutputFormat.java:64) 
    at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:75) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030) 
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:88) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
2017-04-06 14:39:32,669 ERROR [task-result-getter-0] scheduler.TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 

我如何能解决这个任何想法?

+0

你写什么的HBase样的数据? – Vidya

+0

谢谢@Vidya我已经找到修复并在 – bachr

回答

1

我发现修复,显然Put(和所有的HBase突变)都有一个特定的序列号MutationSerialization

下面这行修复该问题:

config.setStrings("io.serializations", 
    config.get("io.serializations"), 
    MutationSerialization.class.getName(), 
    ResultSerialization.class.getName()); 
+0

以下共享我遇到了一个非常相似的情况,但是我的类型是:'JavaPairRDD ',使用上面的类没有帮助,任何想法哪一个我应该使用? 'Result'从'org.apache.hadoop.hbase.client.Result'中导入。 – FisherCoder

+0

'ResultSerialization'应该足够了,但如果我尝试执行'putRdd.first()'或'putRdd.collect()',我仍然会看到一个spark序列化异常。在我的情况下,我只想存储到HDFS或返回HBase,上面的代码就足够了。 – bachr

相关问题