斯卡拉合并两个或更多个字符串作为一个JSON属性

数组我有很多很多的文件JSON字符串线，他们有这样的：斯卡拉合并两个或更多个字符串作为一个JSON属性

{ "id":123, "team":"A", "etc":"...", ...} 
{ "id":124, "team":"A", "etc":"...", ...} 
{ "id":124, "team":"B", "etc":"...", ...} 
{ "id":125, "team":"A", "etc":"...", ...}

我可以在Scala中加载它们的数据帧。

通过用ID分组，我想这样的：

{ "id":123, "team":"A", "etc":"...", ...} 
{ "id":124, "team":["A","B"], "etc":"...", ...} 
{ "id":125, "team":"A", "etc":"...", ...}

在Scala中，我该怎么办呢？

注：我不知道子属性有多少是在每个JSON。大多数属性在json行中都很常见。但是在几个json行中可能会有一些独特的属性。

来源

2017-02-24 Daebarkee

做你想要做这Apache的火花？ –

是的！ Apache的火花。 – Daebarkee

如果我理解正确，您希望按ID进行分组并将每个单独列收集为列表？

更新使用列的动态列表：

df: org.apache.spark.sql.DataFrame = [etc: string, id: bigint ... 1 more field] 

scala> df.show 
+---+---+----+ 
|etc| id|team| 
+---+---+----+ 
| X|123| A| 
| Y|124| A| 
| Z|124| B| 
| X|125| A| 
+---+---+----+ 

val grpCol = "id" 
val collectCols = (df.columns.toSet - grpCol).map(c => collect_list(c).as(c)).toSeq 

df.groupBy('id).agg(collectCols.head, collectCols.tail: _*).show 

+---+------+------+ 
| id| etc| team| 
+---+------+------+ 
|124|[Y, Z]|[A, B]| 
|123| [X]| [A]| 
|125| [X]| [A]| 
+---+------+------+

来源

2017-02-24 03:35:54 Traian

谢谢。但是，还有一个问题。我不知道其他专栏会有多少。有没有什么聪明的方法可以调用collect_list（）获取不定数量的列？ – Daebarkee

更新为使用动态列列表。 – Traian

谢谢@PatRox。这完美的作品！ – Daebarkee

斯卡拉合并两个或更多个字符串作为一个JSON属性

回答

相关问题