Skip to content Skip to sidebar Skip to footer

Sas Proc Freq With Pyspark (frequency, Percent, Cumulative Frequency, And Cumulative Percent)

I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does u

Solution 1:

You can first group by state to get the frequency and percent, then use sum over a window to get the cumulative frequency and percent:

result = df.groupBy('state').agg(
    F.count('state').alias('Frequency')
).selectExpr(
    '*',
    '100 * Frequency / sum(Frequency) over() Percent'
).selectExpr(
    '*',
    'sum(Frequency) over(order by Frequency desc) Cumulative_Frequency', 
    'sum(Percent) over(order by Frequency desc) Cumulative_Percent'
)

result.show()
+-------------+---------+-------+--------------------+------------------+
|        state|Frequency|Percent|Cumulative_Frequency|Cumulative_Percent|
+-------------+---------+-------+--------------------+------------------+
|West Virginia|        5|   50.0|                   5|              50.0|
|     Delaware|        3|   30.0|                   8|              80.0|
|      Indiana|        2|   20.0|                  10|             100.0|
+-------------+---------+-------+--------------------+------------------+

Post a Comment for "Sas Proc Freq With Pyspark (frequency, Percent, Cumulative Frequency, And Cumulative Percent)"