monotonically_increasing_id() is a PySpark function from the pyspark.sql.functions module. It generates a unique, monotonically increasing, 64-bit integer for each row in a DataFrame.
Characteristics
- Monotonicity: The generated ID values always increase but are not guaranteed to be consecutive. This is because the function generates IDs based on the partition and row index within the partition.
- Partition-awareness: The IDs are unique across the entire DataFrame but start over for each partition. Thus, gaps may exist in the sequence.
- Use cases:
- Adding unique IDs to rows in a distributed DataFrame.
- Generating unique keys for further transformations or joins.
Syntax
from pyspark.sql.functions import monotonically_increasing_id
df_with_id = df.withColumn("unique_id", monotonically_increasing_id())