2016-04-26 102 views
0

我在一个蜂巢表已经列:如何将Hive中的Array [Struct [String,String]]列类型投射到Array [Map [String,String]]?

列名:过滤器

数据类型:

|-- filters: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- name: string (nullable = true) 
| | |-- value: string (nullable = true) 

我想通过它的相应的名称得到此列的值。

我做了什么至今:

val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'") 

val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect() 

Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_])) 

注:铸造于Seq因为WrappedArray到地图的转换是不可能的。

接下来要做什么?

回答

1
I want to get the value from this column by it's corresponding name. 

如果你想简单和可靠的方式来获得按名称的所有值,你可以使用爆炸和过滤汇整表:

case class Data(name: String, value: String) 
case class Filters(filters: Array[Data]) 

val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c"))))) 
df.show() 
+--------------+ 
|  filters| 
+--------------+ 
|[[a,b], [a,c]]| 
|  [[b,c]]| 
+--------------+ 

df.withColumn("filter", explode($"filters")) 
    .select($"filter.name" as "name", $"filter.value" as "value") 
    .where($"name" === "a") 
    .show() 
+----+-----+ 
|name|value| 
+----+-----+ 
| a| b| 
| a| c| 
+----+-----+ 

您也可以收集你的数据,你想要的任何方式:

val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value") 
flatDf.rdd.map(r => Array(r(0), r(1))).collect() 
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c)) 
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not the best idea, if you have many values per key 
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c))) 

如果你想投array[struct]map[string, string]为未来保存到一些存储 - 这是不同的S tory,这种情况可以通过UDF解决。无论如何,只要可以保持代码的可扩展性,就必须避免使用collect()

+0

我想获得值b和c作为字符串的序列或数组而不是行对象。 –

+0

这个工作吗? –

+0

我已经更新了答案 –