2017-08-11 32 views
0

我想一个RDD转换为使用createDataFrame功能和定义schema一个数据帧,并存储所产生的数据帧的JSON创建从RDD一个数据帧:Pyspark - 发行使用定义的架构

df_final = sqlContext.createDataFrame(my_rdd, schema) 
df_final.write.json('/tmp/data') 

的RDD包含下列行:

{"obj1": {"name": "ABC", "dateCreatedUtc": "2017-06-23 00:00:00", "pair1": {"lat": 60.82349395751953, "lon": -8.173828125}, "pair2": {"lat": 49.16015625, "lon": 1.867676019668579}}, "obj2": {"name": "DEF", "pair1": {"lat": "0.00", "lon": "0.00"}, "pair2": {"lat": "0.00", "lon": "0.00"}}} 
{"obj1": {"name": "GHI", "dateCreatedUtc": "2017-06-23 00:00:00", "pair1": {"lat": 10.43567890021344, "lon": -17.34675465}, "pair2": {"lat": 80.36473824, "lon": 4.557957859758945}}, "obj2": {"name": "JKL", "pair1": {"lat": "0.00", "lon": "0.00"}, "pair2": {"lat": "0.00", "lon": "0.00"}}} 
... 
... 
... 

我试图限定schema如下:

schema = StructType([ 
    StructField('obj1', MapType(StringType(), MapType(StringType(), StringType(), True), True), True), 
    StructField('obj2', MapType(StringType(), MapType(StringType(), StringType(), True), True), True) 
]) 

的代码运行正常,但是当我检查我的输出JSON文件,该行看起来如下:

{"obj1": {"name": null, "dateCreatedUtc": null, "pair1": {"lat": 60.82349395751953, "lon": -8.173828125}, "pair2": {"lat": 49.16015625, "lon": 1.867676019668579}}, "obj2": {"name": null, "pair1": {"lat": "0.00", "lon": "0.00"}, "pair2": {"lat": "0.00", "lon": "0.00"}}} 
{"obj1": {"name": null, "dateCreatedUtc": null, "pair1": {"lat": 10.43567890021344, "lon": -17.34675465}, "pair2": {"lat": 80.36473824, "lon": 4.557957859758945}}, "obj2": {"name": null, "pair1": {"lat": "0.00", "lon": "0.00"}, "pair2": {"lat": "0.00", "lon": "0.00"}}} 

TL; DR - 除了纬度和离子吸附所有字段已填入空值。

我可以看到,这意味着RDD行与我定义的模式之间的模式不匹配。然而,我无法弄清楚这个模式的问题,因为我已经照顾/认为我已经满足了嵌套的json结构。

希望有关于此的任何帮助/指针。

谢谢!

回答

0

MapType不是用混合类型表示数据的方法。示例记录的正确模式(我假设这些是Python dicts)将是:

StructType([ 
    StructField("obj1", StructType([ 
     StructField("dateCreatedUtc", StringType(), True), 
     StructField("name", StringType(), True), 
     StructField("pair1", StructType([ 
      StructField("lat", DoubleType(), True), 
      StructField("lon", DoubleType(), True) 
     ]), True), 
     StructField("pair2", StructType([ 
      StructField("lat", DoubleType(), True), 
      StructField("lon", DoubleType(), True) 
     ]), True) 
    ]), True), 
    StructField("obj2", StructType([ 
     StructField("name", StringType(), True), 
     StructField("pair1", StructType([ 
      StructField("lat", StringType(), True), 
      StructField("lon", StringType(), True) 
     ]), True), 
     StructField("pair2", StructType([ 
      StructField("lat", StringType(), True), 
      StructField("lon", StringType(), True) 
     ]), True) 
    ]), True) 
])