2017-07-26 65 views
0

我想读在下面的格式JSON文件: -如何使用scala读取spark中的json文件?

{ 
    "titlename": "periodic", 
    "atom": [ 
     { 
      "usage": "neutron", 
      "dailydata": [ 
    { 
     "utcacquisitiontime": "2017-03-27T22:00:00Z", 
     "datatimezone": "+02:00", 
     "intervalvalue": 28128, 
     "intervaltime": 15   
    }, 
    { 
     "utcacquisitiontime": "2017-03-27T22:15:00Z", 
     "datatimezone": "+02:00", 
     "intervalvalue": 25687, 
     "intervaltime": 15   
    } 
    ] 
    } 
] 
} 

我写我读线:

sqlContext.read.json("user/files_fold/testing-data.json").printSchema 

但我没有得到期望的result-

root                    
    |-- _corrupt_record: string (nullable = true) 

请帮我对这个

+0

你用什么星火版本? –

+4

[如何访问JSON文件中的子实体?](https://stackoverflow.com/questions/44814926/how-to-access-sub-entities-in-json-file) –

回答

3

我建议使用wholeTextFiles读取文件和应用的一些功能将其转换为单行JSON格式。

val json = sc.wholeTextFiles("/user/files_fold/testing-data.json"). 
    map(tuple => tuple._2.replace("\n", "").trim) 

val df = sqlContext.read.json(json) 

你应该有最终的有效dataframe作为

+--------------------------------------------------------------------------------------------------------+---------+ 
|atom                         |titlename| 
+--------------------------------------------------------------------------------------------------------+---------+ 
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic | 
+--------------------------------------------------------------------------------------------------------+---------+ 

,有效schema作为

root 
|-- atom: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- dailydata: array (nullable = true) 
| | | |-- element: struct (containsNull = true) 
| | | | |-- datatimezone: string (nullable = true) 
| | | | |-- intervaltime: long (nullable = true) 
| | | | |-- intervalvalue: long (nullable = true) 
| | | | |-- utcacquisitiontime: string (nullable = true) 
| | |-- usage: string (nullable = true) 
|-- titlename: string (nullable = true) 
+0

感谢Jacek为upvote和更正:) –

0

它可能有东西o将JSON对象存储在文件中,可以打印它还是确保它是您在问题中提供的那个?我这么问是因为我拿了一个,它运行得很好:

val json = 
    """ 
    |{ 
    | "titlename": "periodic", 
    | "atom": [ 
    | { 
    |  "usage": "neutron", 
    |  "dailydata": [ 
    |  { 
    |   "utcacquisitiontime": "2017-03-27T22:00:00Z", 
    |   "datatimezone": "+02:00", 
    |   "intervalvalue": 28128, 
    |   "intervaltime": 15 
    |  }, 
    |  { 
    |   "utcacquisitiontime": "2017-03-27T22:15:00Z", 
    |   "datatimezone": "+02:00", 
    |   "intervalvalue": 25687, 
    |   "intervaltime": 15 
    |  } 
    |  ] 
    | } 
    | ] 
    |} 
    """.stripMargin 

val spark = SparkSession.builder().master("local[*]").getOrCreate() 
spark.read 
    .json(spark.sparkContext.parallelize(Seq(json))) 
    .printSchema() 
0

Apache Spark SQL Docs

请注意,是提供一个JSON文件的文件是不是一个典型的JSON文件。每行必须包含一个单独的,独立的有效JSON对象。

因此,

{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]} 

然后:

val jsonDF = sqlContext.read.json("file") 
jsonDF: org.apache.spark.sql.DataFrame = 
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, 
titlename: string] 
相关问题