spark groupBy操作在199/200挂起

我有一个带有master和两个executors的spark单机群。我有一个RDD[LevelOneOutput]及以下LevelOneOutput类spark groupBy操作在199/200挂起

class LevelOneOutput extends Serializable { 

    @BeanProperty 
    var userId: String = _ 

    @BeanProperty 
    var tenantId: String = _ 

    @BeanProperty 
    var rowCreatedMonth: Int = _ 

    @BeanProperty 
    var rowCreatedYear: Int = _ 

    @BeanProperty 
    var listType1: ArrayBuffer[TypeOne] = _ 

    @BeanProperty 
    var listType2: ArrayBuffer[TypeTwo] = _ 

    @BeanProperty 
    var listType3: ArrayBuffer[TypeThree] = _ 

    ... 
    ... 

    @BeanProperty 
    var listType18: ArrayBuffer[TypeEighteen] = _ 

    @BeanProperty 
    var groupbyKey: String = _ 
}

现在我想这组RDD基于用户id，tenantId，rowCreatedMonth，rowCreatedYear。对于我这样做

val levelOneRDD = inputRDD.map(row => { 
    row.setGroupbyKey(s"${row.getTenantId}_${row.getRowCreatedYear}_${row.getRowCreatedMonth}_${row.getUserId}") 
    row 
}) 

val groupedRDD = levelOneRDD.groupBy(row => row.getGroupbyKey)

这让我在关键的数据作为String和值Iterable[LevelOneOutput]

现在我想生成该组密钥的LevelOneOutput一个单独的对象。对于我在做类似如下：

val rdd = groupedRDD.map(row => { 
    val levelOneOutput = new LevelOneOutput 
    val groupKey = row._1.split("_") 

    levelOneOutput.setTenantId(groupKey(0)) 
    levelOneOutput.setRowCreatedYear(groupKey(1).toInt) 
    levelOneOutput.setRowCreatedMonth(groupKey(2).toInt) 
    levelOneOutput.setUserId(groupKey(3)) 

    var listType1 = new ArrayBuffer[TypeOne] 
    var listType2 = new ArrayBuffer[TypeTwo] 
    var listType3 = new ArrayBuffer[TypeThree] 
    ... 
    ... 
    var listType18 = new ArrayBuffer[TypeEighteen] 

    row._2.foreach(data => { 
    if (data.getListType1 != null) listType1 = listType1 ++ data.getListType1 
    if (data.getListType2 != null) listType2 = listType2 ++ data.getListType2 
    if (data.getListType3 != null) listType3 = listType3 ++ data.getListType3 
    ... 
    ... 
    if (data.getListType18 != null) listType18 = listType18 ++ data.getListType18 
    }) 

    if (listType1.isEmpty) levelOneOutput.setListType1(null) else levelOneOutput.setListType1(listType1) 
    if (listType2.isEmpty) levelOneOutput.setListType2(null) else levelOneOutput.setListType2(listType2) 
    if (listType3.isEmpty) levelOneOutput.setListType3(null) else levelOneOutput.setListType3(listType3) 
    ... 
    ... 
    if (listType18.isEmpty) levelOneOutput.setListType18(null) else levelOneOutput.setListType18(listType18) 

    levelOneOutput 
})

这为预期输入的小规模工作，但是当我尝试在更大的一组输入数据运行，由手术组是越来越挂在199/200，我没有看到任何标准输出特定错误或警告/标准错误

能有人指出我作业为什么没有进一步继续...

来源

2017-03-06 Prasad Khode

而不是使用groupBy操作，我创建配对RDD像低于

val levelOnePairedRDD = inputRDD.map(row => { 
    row.setGroupbyKey(s"${row.getTenantId}_${row.getRowCreatedYear}_${row.getRowCreatedMonth}_${row.getUserId}") 
    (row.getGroupByKey, row) 
})

并更新了处理逻辑，解决了我的问题。

来源

2017-08-05 07:39:36

spark groupBy操作在199/200挂起

回答

相关问题