2016-10-21 111 views
0

我有一个jsonfile是parsed.The JSON格式解析jsonfile是这样的:如何与火花

{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}} 

我必须让每一个字在file.How我可以从一个阵列得到"major"我是否必须使用方法df.select("cv_parse.basic_info.location.province")来得到“省”字?

这是我想要的结果:

cv_id major degree birthyear state 
001 English Bachelor 1984  New York 
001 English Master  1984  New York 

回答

0

这可能不是做的最好的方式,但你可以给它一个镜头。

// import the implicits functions 
import org.apache.spark.sql.functions._ 
import sqlContext.implicits._ 

//read the json file 
val jsonDf = sqlContext.read.json("sample-data/sample.json") 

jsonDf.printSchema 

你的模式将是:

root 
|-- cv_id: string (nullable = true) 
|-- cv_parse: struct (nullable = true) 
| |-- basic_info: struct (nullable = true) 
| | |-- birthyear: string (nullable = true) 
| | |-- location: struct (nullable = true) 
| | | |-- state: string (nullable = true) 
| |-- educations: array (nullable = true) 
| | |-- element: struct (containsNull = true) 
| | | |-- degree: string (nullable = true) 
| | | |-- major: string (nullable = true) 

现在,您需要可以有爆炸educations

val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"), 
     $"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state") 

    explodedResult.printSchema 

现在您的架构将是

root 
|-- cv_id: string (nullable = true) 
|-- col: struct (nullable = true) 
| |-- degree: string (nullable = true) 
| |-- major: string (nullable = true) 
|-- birthyear: string (nullable = true) 
|-- state: string (nullable = true) 

现在你可以选择列umns

explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show 

+-----+---------+--------+--------+-------+ 
|cv_id|birthyear| state| degree| major| 
+-----+---------+--------+--------+-------+ 
| 001|  1984|New York|Bachelor|English| 
| 001|  1984|New York| Master |English| 
+-----+---------+--------+--------+-------+