2017-02-18 52 views
1

我正在从元素类型为com.google.gson.JsonObject的RDD中读取数据。尝试将其转换为DataSet,但不知道如何做到这一点。Spark如何将RDD [JSONObject]转换为数据集

import com.google.gson.{JsonParser} 
import org.apache.hadoop.io.LongWritable 
import org.apache.spark.sql.{SparkSession} 

object tmp { 
    class people(name: String, age: Long, phone: String) 
    def main(args: Array[String]): Unit = { 
    val spark = SparkSession.builder().master("local[*]").getOrCreate() 
    val sc = spark.sparkContext 

    val parser = new JsonParser(); 
    val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject() 
    val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject() 

    val PairRDD = sc.parallelize(List(
     (new LongWritable(1l), jsonObject1), 
     (new LongWritable(2l), jsonObject2) 
    )) 

    val rdd1 =PairRDD.map(element => element._2) 

    import spark.implicits._ 

    //How to create Dataset as schema People from rdd1? 
    } 
} 

即使试图打印RDD1集元素抛出

object not serializable (class: org.apache.hadoop.io.LongWritable, value: 1) 
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object) 
- object (class scala.Tuple2, (1,{"name":"abc","age":23,"phone":"0208"})) 

基本上,我得到这个RDD [LongWritable,JsonParser]从BigQuery表,我要转换为数据集,所以我可以申请SQL进行改造。

我已经故意将第二条记录中的电话留空,BigQuery不会为该元素返回空值。

回答

1

感谢您的澄清。您需要在kryo中将类注册为Serializable。以下显示工作。我在火花shell中运行,从而不得不摧毁旧的上下文并创建一个配置,其中包括注册KRYO类

import com.google.gson.{JsonParser} 
import org.apache.hadoop.io.LongWritable 
import org.apache.spark.SparkContext 

sc.stop() 

val conf = sc.getConf 
conf.registerKryoClasses(Array(classOf[LongWritable], classOf[JsonParser])) 
conf.get("spark.kryo.classesToRegister") 

val sc = new SparkContext(conf) 

val parser = new JsonParser(); 
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject() 
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject() 

val pairRDD = sc.parallelize(List(
    (new LongWritable(1l), jsonObject1), 
    (new LongWritable(2l), jsonObject2) 
)) 


val rdd = pairRDD.map(element => element._2) 

rdd.collect() 
// res9: Array[com.google.gson.JsonObject] = Array({"name":"abc","age":23,"phone":"0208"}, {"name":"xyz","age":33}) 

val jsonstrs = rdd.map(e=>e.toString).collect() 
val df = spark.read.json(sc.parallelize(jsonstrs))  
df.printSchema 
// root 
// |-- age: long (nullable = true) 
// |-- name: string (nullable = true) 
// |-- phone: string (nullable = true) 
+0

由于邻省出新的火花背景下,我已经编辑我的问题,如果,让更多的想法。 – xstack2000

相关问题