#Function

monotonically_increasing_id() is a PySpark function from the pyspark.sql.functions module. It generates a unique, monotonically increasing, 64-bit integer for each row in a DataFrame.

Characteristics

  1. Monotonicity: The generated ID values always increase but are not guaranteed to be consecutive. This is because the function generates IDs based on the partition and row index within the partition.
  2. Partition-awareness: The IDs are unique across the entire DataFrame but start over for each partition. Thus, gaps may exist in the sequence.
  3. Use cases:
    • Adding unique IDs to rows in a distributed DataFrame.
    • Generating unique keys for further transformations or joins.

Syntax

from pyspark.sql.functions import monotonically_increasing_id

df_with_id = df.withColumn("unique_id", monotonically_increasing_id())