2017-04-18 94 views
1

我有一个模拟一个数据框的情况,它看起来像下图。如何计算一个数据帧的百分比

Area Type NrPeople  
1  House 200 
1  Flat  100 
2  House 300 
2  Flat  400 
3  House 1000 
4  Flat  250 

想怎么计算和按降序排列每区域返回的人NR,但最重要的我很难计算整体百分比。

结果应该是这样的:

Area SumPeople  %  
3  1000  44% 
2  700  31% 
1  300  13% 
4  250  11% 

下面参见代码示例:

HouseDf = spark.createDataFrame([("1", "House", "200"), 
           ("1", "Flat", "100"), 
           ("2", "House", "300"), 
           ("2", "Flat", "400"), 
           ("3", "House", "1000"), 
           ("4", "Flat", "250")], 
           ["Area", "Type", "NrPeople"]) 

import pyspark.sql.functions as fn 
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')) 

Top = HouseDf\ 
    .groupBy('Area')\ 
    .agg(fn.sum('NrPeople').alias('SumPeople'))\ 
    .orderBy('SumPeople', ascending=False)\ 
    .withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\ 
Top.show() 

此失败:不支持的操作数类型(个),/: 'INT' 和 '数据帧'

任何想法欢迎如何做到这一点!

回答

1

那么,错误似乎是非常直接的,Total是一个data.frame,你不能划分一个整数的数据帧。首先,你可以将其转换为使用collect

Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0] 

然后一个整数,有一些额外的格式,下面应该工作

HouseDf\ 
    .groupBy('Area')\ 
    .agg(fn.sum('NrPeople').alias('SumPeople'))\ 
    .orderBy('SumPeople', ascending = False)\ 
    .withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\ 
    .show() 

+----+---------+----+ 
|Area|SumPeople| %| 
+----+---------+----+ 
| 3| 1000.0|44% 
| 
| 2| 700.0|31% 
| 
| 1| 300.0|13% 
| 
| 4| 250.0|11% 
| 
+----+---------+----+ 

虽然我不知道,如果%是一个很好的列名因为它很难重复使用,所以可能会考虑将其命名为Percent等。

+1

谢谢大卫,这个工作非常出色:-) –

2

你需要窗口功能 -

import pyspark.sql.functions as fn 
from pyspark.sql.functions import rank,sum,col 
from pyspark.sql import Window 

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) 

HouseDf\ 
.groupBy('Area')\ 
.agg(fn.sum('NrPeople').alias('SumPeople'))\ 
.orderBy('SumPeople', ascending=False)\ 
.withColumn('total',sum(col('SumPeople')).over(window))\ 
.withColumn('Percent',col('SumPeople')*100/col('total'))\ 
.drop(col('total')).show() 

输出:

+----+---------+------------------+ 
|Area|SumPeople|   Percent| 
+----+---------+------------------+ 
| 3| 1000.0| 44.44444444444444| 
| 2| 700.0| 31.11111111111111| 
| 1| 300.0|13.333333333333334| 
| 4| 250.0| 11.11111111111111| 
+----+---------+------------------+