如何在Spark/Scala中使用countDistinct？

我试图用聚集在斯卡拉火花数据帧一列，像这样：如何在Spark/Scala中使用countDistinct？

import org.apache.spark.sql._ 

dfNew.agg(countDistinct("filtered"))

，但我得到的错误：

error: value agg is not a member of Unit

任何人都可以解释，为什么？

编辑：澄清我在做什么：我有一个字符串数组的列，我想统计所有行上的不同元素，对其他列没有兴趣。数据：

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
|racist|filtered                                      | 
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, , https://time.com/sxp3onz1w8]                  | 
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                    |

而且我想算过滤，赠送：

rt:2, @dope_promo:1, crew:1, ...frog:2 etc

来源

2017-07-03 schoon

对于聚合函数，您需要首先应用groupBy。这可以帮助你https://stackoverflow.com/questions/33500816/how-to-use-countdistinct-in-scala-with-spark –

可能的重复[如何在Scala中使用countDistinct与Spark？]（https：///stackoverflow.com/questions/33500816/how-to-use-countdistinct-in-scala-with-spark） –

好吧，也许我试图使用错误的功能。我有一个字符串是一个字符串数组，我想统计所有行的不同元素，对其他列没有兴趣。我将编辑我的问题来反映这一点。 – schoon

您需要首先explode您的阵列之前，你可以指望出现次数：查看每个元素的计数：

dfNew 
.withColumn("filtered",explode($"filtered")) 
.groupBy($"filtered") 
.count 
.orderBy($"count".desc) 
.show

或只是为了得到不同元素的计数：

val count = dfNew 
.withColumn("filtered",explode($"filtered")) 
.select($"filtered") 
.distinct 
.count

来源

2017-07-03 19:18:40

如何在Spark/Scala中使用countDistinct？

回答

相关问题