In PySpark, a Window function is used to define a window specification for performing calculations across a group of rows related to the current row. These window functions allow operations like ranking, aggregation, and running totals, but without having to collapse the data (i.e., without using group-by operations). Instead, the window specification allows the calculation to be done while maintaining the individual row's context.
Syntax: -
from pyspark.sql.window import Window
from pyspark.sql.functions import col, rank, row_number
# Define a Window specification
windowSpec = Window.partitionBy("column_to_group").orderBy("column_to_order")
#orderBy(desc("column_to_order")) or orderBy(-"col_to_ordr")
# Apply a window function
df_with_rank = df.withColumn("rank", rank().over(windowSpec))
row_number(): Assigns a unique row number to each row within a window partition.rank(): Assigns a rank to each row within a window partition. It handles ties by assigning the same rank to equal values, leaving gaps.dense_rank(): Similar torank(), but it doesn't leave gaps in rank values.sum(),avg(),min(),max(): Aggregates over the window, similar to group-by but with rows retaining their individual context.lead(): Provides access to a row at a given physical offset from the current row.lag(): Provides access to a row at a given physical offset before the current row.
Example: -
from pyspark.sql.functions import sum
# Define window specification for cumulative sum
windowSpec = Window.partitionBy("Name").orderBy("Age").rowsBetween(Window.unboundedPreceding, Window.currentRow)
# Apply cumulative sum
df_with_cumsum = df.withColumn("cumulative_sum", sum("Age").over(windowSpec))
df_with_cumsum.show()
+-----+---+-------------+
| Name|Age|cumulative_sum|
+-----+---+-------------+
|Alice| 30| 30|
|Alice| 40| 70|
|Alice| 50| 120|
| Bob| 30| 30|
| Bob| 35| 65|
| Bob| 45| 110|
+-----+---+-------------+