arrays_zip() is a PySpark function that takes multiple arrays (or columns containing arrays) as input and "zips" them together into a single array of struct elements. Each struct contains one element from each input array at the same position.
Syntax: -
from pyspark.sql.functions import arrays_zip
arrays_zip(*cols)
Example: -
# Sample DataFrame with array columns
data = [(1, ["a", "b", "c"], [1, 2, 3]),
(2, ["d", "e", "f"], [4, 5, 6])]
df = spark.createDataFrame(data, ["ID", "Array1", "Array2"])
# Use arrays_zip to combine Array1 and Array2
df_zipped = df.withColumn("Zipped", arrays_zip("Array1", "Array2"))
df_zipped.show(truncate=False)
+---+---------+---------+------------------+
|ID |Array1 |Array2 |Zipped |
+---+---------+---------+------------------+
|1 |[a, b, c]|[1, 2, 3]|[[a, 1], [b, 2], [c, 3]]|
|2 |[d, e, f]|[4, 5, 6]|[[d, 4], [e, 5], [f, 6]]|
+---+---------+---------+------------------+
Use Cases:
- Merging Data: Useful when you have multiple arrays and want to combine them into a structured format.
- Parallel Processing: Allows for easier manipulation when processing corresponding elements of multiple arrays together.