2017-05-11 39 views
2

我有这个DataFrame df有3列:id,typeactivity如何按列分组数据并计算每个组的观察次数

val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"), 
       ("aa3","GROUP_B", "11"),("aa3","GROUP_B","12"),("aa2", "GROUP_3", "12")) 

val df = sc.parallelize(myData).toDF() 

我需要组数据由type,然后计算对于每个id活动数。这是预期的结果:

type  id count 
GROUP_A aa1 2 
GROUP_A aa2 1 
GROUP_B aa3 3 
GROUP_B aa2 1 

这是我的尝试:

df.groupBy("type","id").count().sort("count").show() 

但是它没有给出正确的结果。

回答

1

我最小改变你的样本数据和它的作品对我来说:

//yours 
val myData = (Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "hy"),("aa2", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12"),("aa2", "GROUP_3", "12")) 

//mine 
//removed the (at the beginning 
//changed GROUP_3 to GROUP_B 
//other minor changes so that the resultant group by will look like you desired 
val myData = Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "12"),("aa3", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12"),("aa2", "GROUP_B", "12")) 


//yours 
val df = sc.parallelize(myData).toDF() 
//mine 
//added in column names 

val df = sc.parallelize(myData).toDF("id","type","count") 

df.groupBy("type","id").count.show 
+-------+---+-----+ 
| type| id|count| 
+-------+---+-----+ 
|GROUP_A|aa1| 2| 
|GROUP_A|aa2| 1| 
|GROUP_B|aa2| 1| 
|GROUP_B|aa3| 3| 
+-------+---+-----+ 

有什么我错过了什么?

+0

非常感谢。它应该是'toDF(“id”,“type”,“count”)'因为'aa..'是'id'。我来检查一下。 – Dinosaurius

+0

编辑我的答案,这是它应该如何 –

0

当您创建dataframe并在grouped data上进行计数时,您可以定义column names。这应该很容易

import sqlContext.implicits._ 

val myData = Seq(("aa1", "GROUP_A", "10"), 
    ("aa1","GROUP_A", "12"), 
    ("aa2","GROUP_A", "hy"), 
    ("aa2", "GROUP_B", "14"), 
    ("aa3","GROUP_B", "11"), 
    ("aa3","GROUP_B","12"), 
    ("aa3", "GROUP_B", "12")) 

val df = sc.parallelize(myData).toDF("id", "type", "activity") 
df.groupBy("type","id").count().sort("count").show() 
相关问题