Sas Proc Freq With Pyspark (frequency, Percent, Cumulative Frequency, And Cumulative Percent)
I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does u
Solution 1:
You can first group by state to get the frequency and percent, then use sum
over a window to get the cumulative frequency and percent:
result = df.groupBy('state').agg(
F.count('state').alias('Frequency')
).selectExpr(
'*',
'100 * Frequency / sum(Frequency) over() Percent'
).selectExpr(
'*',
'sum(Frequency) over(order by Frequency desc) Cumulative_Frequency',
'sum(Percent) over(order by Frequency desc) Cumulative_Percent'
)
result.show()
+-------------+---------+-------+--------------------+------------------+
| state|Frequency|Percent|Cumulative_Frequency|Cumulative_Percent|
+-------------+---------+-------+--------------------+------------------+
|West Virginia| 5| 50.0| 5| 50.0|
| Delaware| 3| 30.0| 8| 80.0|
| Indiana| 2| 20.0| 10| 100.0|
+-------------+---------+-------+--------------------+------------------+
Post a Comment for "Sas Proc Freq With Pyspark (frequency, Percent, Cumulative Frequency, And Cumulative Percent)"