2015-12-31 218 views
-2

我有由两个列movieid的数据帧,并应用到电影中的标签在下面的格式 -使用火花数据帧

movieid tag                      

1   animation 
1   pixar 
1   animation 
2   comedy                    

我想指望每个计算每部电影的标签频率电影ID每个标签应用了多少次,还想计算应用于每部电影的标签总数。我是新来的火花。

回答

0

这在PySpark,这里有云:

创建DF:

sqlContext = SQLContext(sc) 
data = [(1,'animation'),(1,'pixar'),(1,'animation'),(2,'comedy')] 
RDD = sc.parallelize(data) 
orders_df = sqlContext.createDataFrame(RDD,["movieid","tag"]) 
orders_df.show() 

+-------+---------+ 
|movieid|  tag| 
+-------+---------+ 
|  1|animation| 
|  1| pixar| 
|  1|animation| 
|  2| comedy| 
+-------+---------+ 

计算:

orders_df.groupBy(['movieid','tag']).count().show() #count for each movie id how many times each tags are applied 

+-------+---------+-----+ 
|movieid|  tag|count| 
+-------+---------+-----+ 
|  1| pixar| 1| 
|  1|animation| 2| 
|  2| comedy| 1| 
+-------+---------+-----+ 

orders_df.groupBy(['movieid']).count().show() #number of tags applied to each movie 

+-------+-----+ 
|movieid|count| 
+-------+-----+ 
|  1| 3| 
|  2| 1| 
+-------+-----+ 
+0

感谢help.It工作罚款。相当于Scala代码是 - orders_df.groupBy( “movieid”, “标签”)。COUNT()。节目() – sasmita