1
我正在阅读社交网络的json文件为spark。我从这些数据框中获得了我为了获得配对而爆炸的数据。 这个过程很完美。稍后我想将其转换为RDD(用于GraphX),但创建RDD需要很长时间。火花数据帧转换为rdd需要很长时间
val social_network = spark.read.json(my/path) // 200MB
val exploded_network = social_network.
withColumn("follower", explode($"followers")).
withColumn("id_follower", ($"follower").cast("long")).
withColumn("id_account", ($"account").cast("long")).
withColumn("relationship", lit(1)).
select("id_follower", "id_account", "relationship")
val E1 = exploded_network.as[(VertexId, VertexId, Int)]
val E2 = E1.rdd
要检查的过程是如何运行的,我算在每一步
scala> exploded_network.count
res0: Long = 18205814 // 3 seconds
scala> E1.count
res1: Long = 18205814 // 3 seconds
scala> E2.count // 5.4 minutes
res2: Long = 18205814
为什么RDD转换以100倍?