Window

#Window

In PySpark, a Window function is used to define a window specification for performing calculations across a group of rows related to the current row. These window functions allow operations like ranking, aggregation, and running totals, but without having to collapse the data (i.e., without using group-by operations). Instead, the window specification allows the calculation to be done while maintaining the individual row's context.

Syntax: -

from pyspark.sql.window import Window
from pyspark.sql.functions import col, rank, row_number

# Define a Window specification
windowSpec = Window.partitionBy("column_to_group").orderBy("column_to_order")
#orderBy(desc("column_to_order")) or orderBy(-"col_to_ordr")

# Apply a window function
df_with_rank = df.withColumn("rank", rank().over(windowSpec))

row_number(): Assigns a unique row number to each row within a window partition.
rank(): Assigns a rank to each row within a window partition. It handles ties by assigning the same rank to equal values, leaving gaps.
dense_rank(): Similar to rank(), but it doesn't leave gaps in rank values.
sum(), avg(), min(), max(): Aggregates over the window, similar to group-by but with rows retaining their individual context.
lead(): Provides access to a row at a given physical offset from the current row.
lag(): Provides access to a row at a given physical offset before the current row.

Example: -

from pyspark.sql.functions import sum

# Define window specification for cumulative sum
windowSpec = Window.partitionBy("Name").orderBy("Age").rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Apply cumulative sum
df_with_cumsum = df.withColumn("cumulative_sum", sum("Age").over(windowSpec))

df_with_cumsum.show()

+-----+---+-------------+
| Name|Age|cumulative_sum|
+-----+---+-------------+
|Alice| 30|           30|
|Alice| 40|           70|
|Alice| 50|          120|
|  Bob| 30|           30|
|  Bob| 35|           65|
|  Bob| 45|          110|
+-----+---+-------------+