2017-10-06 49 views
-1

我有一个火花数据框,它有两个由函数collect_set组成的列。我想将这两列的集合组合成一列。我应该怎么做?他们都是一套串pyspark - 合并2列的集合

。例如我从呼叫collect_set形成2列

Fruits     | Meat 
[Apple,Orange,Pear]   [Beef, Chicken, Pork] 

我如何把它变成:

Food 

[Apple,Orange,Pear, Beef, Chicken, Pork] 

非常感谢您对您的帮助提前

+1

请提供更多的信息,例如数据框的结构与例子 –

回答

0

假设df

+--------------------+--------------------+ 
|    Fruits|    Meat| 
+--------------------+--------------------+ 
|[Pear, Orange, Ap...|[Chicken, Pork, B...| 
+--------------------+--------------------+ 

然后

import itertools 
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect() 

创建一组Fruits & Meat组合成一组即

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']] 


希望这有助于!

1

鉴于您有dataframe作为

+---------------------+---------------------+ 
|Fruits    |Meat     | 
+---------------------+---------------------+ 
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]| 
+---------------------+---------------------+ 

您可以编写一个udf函数将两列的集合合并为一个。

import org.apache.spark.sql.functions._ 
def mergeCols = udf((furits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => furits ++ meat) 

,然后调用udf功能

df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false) 

你应该有你的期望的最终dataframe

+---------------------+---------------------+------------------------------------------+ 
|Fruits    |Meat     |Food          | 
+---------------------+---------------------+------------------------------------------+ 
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]| 
+---------------------+---------------------+------------------------------------------+ 
+0

这是与python?我似乎无法找到mutable.WrappedArray – soulless

+0

这是所有scala :) –