2014-10-02 52 views
0

我有要事(T1,K1,V1),(T2,K2,V3),(T3,K1,V2),(T4,K2,V4),(T5,K1,V5) 键和值都是字符串的时间顺序阿帕奇星火 - 减少步骤输出(K,(V1,V2,V3,...)

我试图实现。以下使用星火

K1,(V1,V2,V5) 
K2,(V3,V4) 

这是我试过

val inputFile = args(0) 
val outputFile = args(1) 
val conf = new SparkConf().setAppName("MyApp") 
val sc = new SparkContext(conf) 
val rdd1 = sc.textFile(inputFile, 2).cache() 
val rdd2= rdd1.map { 
    line => 
     val fields = line.split(" ") 
     val key = fields(1) 
     val v = fields(2) 
     (key, v) 
    } 
// TODO : rdd2.reduce to get the output I want 
rdd2.saveAsTextFile(outputFile) 

可能有人请点我朝着如何让减速机生产我想要的输出?许多感谢提前。

+0

您可以参考文档,在部分('groupByKey','aggregateByKey')。 http://spark.apache.org/docs/latest/programming-guide.html – 2014-10-02 03:40:23

回答

2

你只需将你的钥匙RDD以达到所需的输出:rdd2.groupByKey

这个小火花shell会话说明用法:

val events = List(("t1","k1","v1"), ("t2","k2","v3"), ("t3","k1","v2"), ("t4","k2","v4"), ("t5","k1","v5")) 
val rdd = sc.parallelize(events) 
val kv = rdd.map{case (t,k,v) => (k,v)} 
val grouped = kv.groupByKey 
// show the collection ('collect' used here only to show the contents) 
grouped.collect 
res0: Array[(String, Iterable[String])] = Array((k1,ArrayBuffer(v1, v2, v5)), (k2,ArrayBuffer(v3, v4))) 
相关问题