2017-10-18 80 views
0
from pyspark.sql import Row, functions as F 
row = Row("UK_1","UK_2","Date","Cat") 
agg = '' 
agg = 'Cat' 
tdf = (sc.parallelize 
    ([ 
     row(1,1,'12/10/2016',"A"), 
     row(1,2,None,'A'), 
     row(2,1,'14/10/2016','B'), 
     row(3,3,'!~2016/2/276','B'), 
     row(None,1,'26/09/2016','A'), 
     row(1,1,'12/10/2016',"A"), 
     row(1,2,None,'A'), 
     row(2,1,'14/10/2016','B'), 
     row(None,None,'!~2016/2/276','B'), 
     row(None,1,'26/09/2016','A') 
     ]).toDF()) 
tdf.groupBy( iff(len(agg.strip()) > 0 , F.col(agg), )).agg(F.count('*').alias('row_count')).show() 

有没有一种方法可以根据数据框组中的某些条件使用列或不使用列?Pyspark DataFrame有条件的组

回答

1

可以提供空单groupBy如果你正在寻找的条件没有被满足,这将groupBy无柱:

tdf.groupBy(agg if len(agg) > 0 else []).agg(...) 

agg = '' 
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show() 
+---------+ 
|row_count| 
+---------+ 
|  10| 
+---------+ 

agg = 'Cat' 
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show() 
+---+---------+ 
|Cat|row_count| 
+---+---------+ 
| B|  4| 
| A|  6| 
+---+---------+