我有一个模拟一个数据框的情况,它看起来像下图。如何计算一个数据帧的百分比
Area Type NrPeople
1 House 200
1 Flat 100
2 House 300
2 Flat 400
3 House 1000
4 Flat 250
想怎么计算和按降序排列每区域返回的人NR,但最重要的我很难计算整体百分比。
结果应该是这样的:
Area SumPeople %
3 1000 44%
2 700 31%
1 300 13%
4 250 11%
下面参见代码示例:
HouseDf = spark.createDataFrame([("1", "House", "200"),
("1", "Flat", "100"),
("2", "House", "300"),
("2", "Flat", "400"),
("3", "House", "1000"),
("4", "Flat", "250")],
["Area", "Type", "NrPeople"])
import pyspark.sql.functions as fn
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total'))
Top = HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
此失败:不支持的操作数类型(个),/: 'INT' 和 '数据帧'
任何想法欢迎如何做到这一点!
谢谢大卫,这个工作非常出色:-) –