2017-06-10 123 views
2

我有一个火花数据帧,这里是架构:访问WrappedArray元素

|-- eid: long (nullable = true) 
|-- age: long (nullable = true) 
|-- sex: long (nullable = true) 
|-- father: array (nullable = true) 
| |-- element: array (containsNull = true) 
| | |-- element: long (containsNull = true) 

和行的样本:.

df.select(df['father']).show() 
+--------------------+ 
|    father| 
+--------------------+ 
|[WrappedArray(-17...| 
|[WrappedArray(-11...| 
|[WrappedArray(13,...| 
+--------------------+ 

和类型是

DataFrame[father: array<array<bigint>>] 

怎样可以访问内部数组的每个元素?例如第一行中的-17? 我尝试了像df.select(df['father'])(0)(0).show()这样的不同的东西,但没有运气。

回答

1

在阶溶液应尽可能

import org.apache.spark.sql.functions._ 
val data = sparkContext.parallelize("""{"eid":1,"age":30,"sex":1,"father":[[1,2]]}""" :: Nil) 
val dataframe = sqlContext.read.json(data).toDF() 

数据帧看起来

+---+---+---+--------------------+ 
|eid|age|sex|father    | 
+---+---+---+--------------------+ 
|1 |30 |1 |[WrappedArray(1, 2)]| 
+---+---+---+--------------------+ 

溶液应

dataframe.select(col("father")(0)(0) as("first"), col("father")(0)(1) as("second")).show(false) 

输出应该是

+-----+------+ 
|first|second| 
+-----+------+ 
|1 |2  | 
+-----+------+ 
+0

为什么你用'array'函数包装你的列? 'dataframe.select($“father”(0)(0))'或'dataframe.select(col(“father”)(0)(0))'也可以正常工作 –

+1

@RaphaelRoth是的,你是对的。 :) 谢谢 –

1

另一个阶的答案是这样的:

df.select(col("father").getItem(0) as "father_0", col("father").getItem(1) as "father_1")