2016-11-16 92 views
2

我能够合并和排序值但无法找出的条件不合并,如果这两个值相等如何合并两个列与pyspark中的条件?

df = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo")], ("k", "K" ,"v" ,"V")) 
columns = df.columns 

k = 0 
for i in range(len(columns)): 
    for j in range(i + 1, len(columns)): 
     if columns[i].lower() == columns[j].lower(): 
     k = k+1 
     df = (df.withColumn(columns[i]+str(k),concat(col(columns[i]),lit(","), col(columns[j])))) 
     newdf = df.select(col("k"),split(col("c1"), ",\s*").alias("c1")) 
     sortDf = newdf.select(newdf.k,sort_array(newdf.c1).alias('sorted_c1')) 

在下面的表中的列K和K只合并[富,巴]但不[巴,巴]

输入:

+---+---+---+---+ 
| k| K| v| V| 
+---+---+---+---+ 
|foo|bar|too|aaa| 
|bar|bar|aaa|foo| 
+---+---+---+---+ 

输出:

+---+---+---+---+-----------+ 
| k| K|Merged K |Merged V | 
+---+---+-------------------+ 
|foo|bar|[foo,bar] |[too,aaa] 
|bar|bar|bar  |[aaa,foo] 
+---+---+---+------+--------+ 

回答

1

Try:

from pyspark.sql.functions import udf 

def merge(*c): 
    merged = sorted(set(c)) 
    if len(merged) == 1: 
     return merged[0] 
    else: 
     return "[{0}]".format(",".join(merged)) 

merge_udf = udf(merge) 

df = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo")], ("k1", "k2" ,"v1" ,"v2")) 

df.select(merge_udf("k1", "k2"), merge_udf("v1", "v2"))