2016-09-27 38 views
0

我与火花和Scala工作,有一个案例类中定义一个数据集,看起来像这样:Spark和Scala的最佳方式映射的情况下classese

case class Shareholders(
    business_id : String, 
    guo_name : String, 
    guo_id : String, 
    duo_name : String, 
    duo_id : String 
) 

有许多与“过”开始多领域/ “双核”。除了这个前缀,字段名称是相同/重复的。

我想形成的情况下的类结构,它看起来像:

case class NewShareholders(
    business_id : String, 
    repeatedFields : Seq[RepeatedShareholderFields] 
) 

case class RepeatedShareholderFields ( 
    name : String, 
    id : String 
    type : String 
) 

其中type =“过” /“哆”等适当。

如何做到最好?

+0

我强烈建议设计Spark'Dataset's,好像它们是SQL关系表。这就是现在Spark优化的。从你的例子到目前为止,我不能为你提出一个好的解决方案;我唯一能说的是咨询DBA如何最好地规范你的“股东”表。 – Yawar

回答

0
import scala.language.existentials 

case class NewShareholder(
    businessId : String, 
    fields : Seq[ShareholderField[T forSome {type T}]] 
) 

case class ShareholderField[T] ( 
    prefix : String, 
    nameValue: T 
    idValue: T,  
) 

// Now you can create you share holders as follows, 

val sh1 = NewShareHolder(
    businessId = "abcde-1234" 
    fields = Seq(
    ShareholderField[String]("guo", "guo-name-1", "guo-id-1") 
    ShareholderField[String]("duo", "duo-name-1", "duo-id-1") 
    ShareholderField[UUID]("luo", UUID.randomUUID(), UUID.randomUUID()) 
) 
) 

如果您知道所有值实际上是String,那么您可以简化它。

case class NewShareholder(
    businessId : String, 
    fields : Seq[ShareholderField] 
) 

case class ShareholderField ( 
    prefix : String, 
    nameValue: String, 
    idValue: String,  
) 

// Now you can create you share holders as follows, 

val sh1 = NewShareHolder(
    businessId = "abcde-1234" 
    fields = Seq(
    ShareholderField("guo", "guo-name-1", "guo-id-1") 
    ShareholderField("duo", "duo-name-1", "duo-id-1") 
    ShareholderField("luo", "luo-name-1", "luo-id-1") 
) 
) 
相关问题