2017-05-24 85 views

回答

0

randint函数是你所需要的:它在两个数字之间生成一个随机整数。将其应用于“年龄”列的fillna spark函数。

from random import randint 
df.fillna(randint(14, 46), 'age').show() 
+0

虽然此代码段可以解决的问题,[包括一个解释](// meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers)确实有助于提高您的帖子的质量。请记住,您将来会为读者回答问题,而这些人可能不知道您的代码建议的原因。也请尽量不要用解释性注释来挤占代码,这会降低代码和解释的可读性! – kayess

1

马拉的答案是正确的,如果你想用相同的随机数来代替空值,但如果你想每个年龄段的随机值,你应该做的事情结合并F.rand()如下图所示:

import pyspark.sql.functions as F 
from pyspark.sql.functions import lit 
from pyspark.sql.types import IntegerType 
from random import randint 

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3")) 

df = (df 
    .withColumn("x4", F.lit(None).cast(IntegerType())) 
    .withColumn("x5", F.lit(None).cast(IntegerType())) 
    ) 

df.na.fill({'x4':randint(0,100)}).show() 
df.withColumn('x5', F.coalesce(F.col('x5'), (F.round(F.rand()*100)))).show() 


+---+---+-----+---+----+ 
| x1| x2| x3| x4| x5| 
+---+---+-----+---+----+ 
| 1| a| 23.0| 9|null| 
| 3| B|-23.0| 9|null| 
+---+---+-----+---+----+ 
+---+---+-----+----+----+ 
| x1| x2| x3| x4| x5| 
+---+---+-----+----+----+ 
| 1| a| 23.0|null|44.0| 
| 3| B|-23.0|null| 2.0| 
+---+---+-----+----+----+