2017-09-24 123 views
0

我有一个数据框表示图的边;这是模式:Scala-Spark:将数据帧转换为RDD [Edge]

root |-- src: string (nullable = true) 
    |-- dst: string (nullable = true) 
    |-- relationship: struct (nullable = false) 
    | |-- business_id: string (nullable = true) 
    | |-- normalized_influence: double (nullable = true) 

我想将其转换为RDD [边缘]与预凝胶API和我困难的工作是对属性的“关系”。如何转换它?

回答

1

Edge是一个参数化的类。这意味着除了源代码和目标代码之外,您还可以在每个边缘存储您喜欢的任何内容。在你的情况下,它可能是一个Edge[Relationship]。您可以使用案例类来映射数据帧和RDD[Edge[Relationship]]

import scala.util.hashing.MurmurHash3 
case class Relationship(business_id: String, normalized_influence: Double) 
case class MyEdge(src: String, dst: String, relationship: Relationship) 

val edges: RDD[Edge[Relationship]] = df.as[MyEdge].rdd.map { edge => 
    Edge(
     MurmurHash3.stringHash(edge.src).toLong, // VertexId type is a Long, so we need to hash your string 
     MurmurHash3.stringHash(edge.dst).toLong, 
     edge.relationship 
    ) 
}