6

我的问题由计算火花数据帧中连续行之间差异的用例触发。避免Spark窗口函数中单个分区模式的性能影响

例如,我有:

>>> df.show() 
+-----+----------+ 
|index|  col1| 
+-----+----------+ 
| 0.0|0.58734024| 
| 1.0|0.67304325| 
| 2.0|0.85154736| 
| 3.0| 0.5449719| 
+-----+----------+ 

如果我选择来计算这些使用 “窗口” 功能,那么我就可以做到这一点,像这样:

>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc()) 
>>> import pyspark.sql.functions as f 
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show() 
+-----+----------+-----------+ 
|index|  col1| diffs_col1| 
+-----+----------+-----------+ 
| 0.0|0.58734024|0.085703015| 
| 1.0|0.67304325| 0.17850411| 
| 2.0|0.85154736|-0.30657548| 
| 3.0| 0.5449719|  null| 
+-----+----------+-----------+ 

问题:我明确将数据帧分区到一个分区中。这对性能的影响是什么,如果有的话,为什么是这样以及如何避免它?因为当我不指定分区,我得到以下警告:

16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. 

回答

6

在实际性能的影响将是几乎一样的,如果你省略partitionBy条款都没有。所有记录都会被混洗到一个单独的分区,在本地排序并逐一依次迭代。

区别只在于创建的分区总数。让我们用简单的数据集与10个分区和1000个记录表明,与一个例子:

df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42)) 

如果您by子句定义框架不分区

w_unpart = Window.orderBy(f.col("index").asc()) 

lag

df_lag_unpart = df.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1") 
) 

使用总共只有一个分区:

df_lag_unpart.rdd.glom().map(len).collect() 
[1000] 

与用哑指标该帧定义(简化的比特相比,您的代码:

w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc()) 

将使用等于分区数到spark.sql.shuffle.partitions

spark.conf.set("spark.sql.shuffle.partitions", 11) 

df_lag_part = df.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1") 
) 

df_lag_part.rdd.glom().count() 
11 

与只有一个非空分区:

df_lag_part.rdd.glom().filter(lambda x: x).count() 
1 

不幸的是,它可以用来在PySpark解决这个问题没有通用的解决方案。这只是实现的一种内在机制,与分布式处理模型相结合。

由于index列是连续的,你可以产生人工分区键与固定数量的每块的记录:

rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions")) 

df_with_block = df.withColumn(
    "block", (f.col("index")/rec_per_block).cast("int") 
) 

,并用它来定义帧规定:

w_with_block = Window.partitionBy("block").orderBy("index") 

df_lag_with_block = df_with_block.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1") 
) 

这将使用预期数量分区:

df_lag_with_block.rdd.glom().count() 
11 

与大致均匀数据分布(我们无法避免哈希冲突):

df_lag_with_block.rdd.glom().map(len).collect() 
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270] 

但对块边界的多项空白:

df_lag_with_block.where(f.col("diffs_col1").isNull()).count() 
12 

由于边界很容易计算:

from itertools import chain 

boundary_idxs = sorted(chain.from_iterable(
    # Here we depend on sequential identifiers 
    # This could be generalized to any monotonically increasing 
    # id by taking min and max per block 
    (idx - 1, idx) for idx in 
    df_lag_with_block.groupBy("block").min("index") 
     .drop("block").rdd.flatMap(lambda x: x) 
     .collect()))[2:] # The first boundary doesn't carry useful inf. 

你可以随时选择:

missing = df_with_block.where(f.col("index").isin(boundary_idxs)) 

并分别填补这些:

# We use window without partitions here. Since number of records 
# will be small this won't be a performance issue 
# but will generate "Moving all data to a single partition" warning 
missing_with_lag = missing.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1") 
).select("index", f.col("diffs_col1").alias("diffs_fill")) 

join

combined = (df_lag_with_block 
    .join(missing_with_lag, ["index"], "leftouter") 
    .withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill"))) 

得到期望的结果:

mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
    combined["diffs_col1"] != df_lag_unpart["diffs_col1"] 
) 
assert mismatched.count() == 0