2
我能够合并和排序值但无法找出的条件不合并,如果这两个值相等如何合并两个列与pyspark中的条件?
df = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo")], ("k", "K" ,"v" ,"V"))
columns = df.columns
k = 0
for i in range(len(columns)):
for j in range(i + 1, len(columns)):
if columns[i].lower() == columns[j].lower():
k = k+1
df = (df.withColumn(columns[i]+str(k),concat(col(columns[i]),lit(","), col(columns[j]))))
newdf = df.select(col("k"),split(col("c1"), ",\s*").alias("c1"))
sortDf = newdf.select(newdf.k,sort_array(newdf.c1).alias('sorted_c1'))
在下面的表中的列K和K只合并[富,巴]但不[巴,巴]
输入:
+---+---+---+---+
| k| K| v| V|
+---+---+---+---+
|foo|bar|too|aaa|
|bar|bar|aaa|foo|
+---+---+---+---+
输出:
+---+---+---+---+-----------+
| k| K|Merged K |Merged V |
+---+---+-------------------+
|foo|bar|[foo,bar] |[too,aaa]
|bar|bar|bar |[aaa,foo]
+---+---+---+------+--------+