arrays_zip()

#Function

arrays_zip() is a PySpark function that takes multiple arrays (or columns containing arrays) as input and "zips" them together into a single array of struct elements. Each struct contains one element from each input array at the same position.

Syntax: -

from pyspark.sql.functions import arrays_zip

arrays_zip(*cols)

Example: -

# Sample DataFrame with array columns
data = [(1, ["a", "b", "c"], [1, 2, 3]), 
        (2, ["d", "e", "f"], [4, 5, 6])]

df = spark.createDataFrame(data, ["ID", "Array1", "Array2"])

# Use arrays_zip to combine Array1 and Array2
df_zipped = df.withColumn("Zipped", arrays_zip("Array1", "Array2"))

df_zipped.show(truncate=False)

+---+---------+---------+------------------+
|ID |Array1   |Array2   |Zipped            |
+---+---------+---------+------------------+
|1  |[a, b, c]|[1, 2, 3]|[[a, 1], [b, 2], [c, 3]]|
|2  |[d, e, f]|[4, 5, 6]|[[d, 4], [e, 5], [f, 6]]|
+---+---------+---------+------------------+

Use Cases:

Merging Data: Useful when you have multiple arrays and want to combine them into a structured format.
Parallel Processing: Allows for easier manipulation when processing corresponding elements of multiple arrays together.