2017-03-18 160 views
0

我有一个与(city, person_id, number)和每个城市我想找到人数最高的RDD。我的第一个想法是使用reduceByKey和城市作为键值(rdd.reduce((num1, num2) => Math.max(num1, num2))),但我不知道如何在进程中保留person_id。节省火花时减少火花(斯卡拉)

回答

0

您需要将您的RDD转换为PairRdd,那么你就可以reduceByKey并保持人与最大数量

rdd.map { case (city, person_id, number) => (city, (person_id, number)) }. 
     reduceByKey { 
     case ((person_id1, n1), (person_id2, n2)) => 
      if (n1 > n2) 
      (person_id1, n1) 
      else 
      (person_id2, n2) 
     }.map { 
     case (city, (person_id, number)) => (city, person_id) 
    }