2017-06-01 36 views
1

如何在spark/scala中打印包含对象嵌套数组的单个元素?如何使用scala/spark正确迭代/打印拼花地板?

{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]} 
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]} 

具体我希望能够然后每个项目迭代的对象,并打印出编号,姓名和年龄...路径。然后继续打印下一个记录和soforth。假设我已经阅读了拼花文件,并有数据帧,我要像做以下(伪):

如果
val records = dataframe.map { 
    row => { 
    val id = row.getString("id") 
    val name = row.getString("id") 
    val age = row.getString("age") 
    println("${id} ${name} ${age}") 
    row.getArray("path").map { 
     item => { 
       val x = item.getValue("x") 
       val y = item.getValue("y") 
       println("${x} ${y}") 
     } 
    } 
    } 
} 

不知道上面是去了解它的正确方法,但它应该给你了解我想要做什么。

回答

1
val spark = SparkSession 
    .builder() 
    .master("local") 
    .appName("ParquetAppendMode") 
    .getOrCreate() 

    import spark.implicits._ 


    val data1 = spark.read.json("/home/sakoirala/IdeaProjects/SparkSolutions/src/main/resources/explode.json") 

    val result = data1.withColumn("path", explode($"path")) 

    result.withColumn("x", result("path.x")) 
    .withColumn("y", result("path.y")).show() 

输出:

val records = dataframe.select("id", "age", "path.x", "path.y") 

然后,您可以使用显示打印数据:

+---+----+-------+-----------+----+----+ 
|age| id| name|  path| x| y| 
+---+----+-------+-----------+----+----+ 
| 25|1201| satish| [1.0,1.0]| 1.0| 1.0| 
| 25|1201| satish| [2.0,2.0]| 2.0| 2.0| 
| 28|1202|krishna|[1.23,2.12]|1.23|2.12| 
| 28|1202|krishna|[1.23,2.12]|1.23|2.12| 
+---+----+-------+-----------+----+----+ 
0

您可以完全使用Dataframe API完成此操作;不需要使用map

下面是如何可以很容易地通过投影领域压扁你的模式要使用:

records.show()