2017-10-04 52 views
0

我想获得百分比频率在pyspark。我这样做的pyhton如下我怎样才能获得百分比频率在pyspark

Companies = df['Company'].value_counts(normalize = True) 

获取的频率是相当简单:

# Dates in descending order of complaint frequency 
df.createOrReplaceTempView('Comp') 
CompDF = spark.sql("SELECT Company, count(*) as cnt \ 
        FROM Comp \ 
        GROUP BY Company \ 
        ORDER BY cnt DESC") 
CompDF.show() 
+--------------------+----+ 
|    Company| cnt| 
+--------------------+----+ 
|BANK OF AMERICA, ...|1387| 
|  EQUIFAX, INC.|1285| 
|WELLS FARGO & COM...|1119| 
|Experian Informat...|1115| 
|TRANSUNION INTERM...|1001| 
|JPMORGAN CHASE & CO.| 905| 
|  CITIBANK, N.A.| 772| 
|OCWEN LOAN SERVIC...| 481| 

如何得到百分之频率从这里?我尝试了一堆没有太多运气的东西。 任何帮助,将不胜感激。

+0

有关使用总如何计算百分比。 – Suresh

+0

如果您发现答案有帮助,请接受它 - 谢谢 – desertnaut

回答

0

由于苏雷什意味着在评论中,假设total_count是行的数据帧Companies数,你可以使用withColumnCompDF添加一个名为percentages新列:

total_count = Companies.count() 

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts)) 
+0

这看起来很干净和直接。谢谢! – Murat

+0

@穆拉特非常欢迎*接受*答案 – desertnaut

0

可能会修改SQL查询会得到您想要的结果。

"SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt 
    FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from 
    (SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt 
    DESC)" 
+0

感谢您的回复! 我得到一个错误:IllegalArgumentException:u'requirement失败:子查询subquery1602尚未完成' 我不得不稍微修改它以获取数据框如下: 'CompDF = spark.sql(“SELECT Company,cnt /(SELECT SUM (cnt)from(SELECT Company,count()as cnt \ FROM Comp GROUP BY Company ORDER BY cnt DESC)temp_tab)来自\(SELECT Company,count()as cnt FROM COMP GROUP BY公司ORDER BY cnt DESC)的sum_freq“) .collect()C = spark.createDataFrame(CDF)C.show()'再次感谢! – Murat

相关问题