collect_list()

#Function

In PySpark, collect_list() is an aggregate function available in the pyspark.sql.functions module. It is used to aggregate data by collecting all the values from a specified column into a single list for each group in a grouped dataset.

Syntax: -

pyspark.sql.functions.collect_list(col)

Key Points

Use collect_list() when you want to aggregate all values in a column into a list, including duplicates.
If you need a list with unique values, use .distinct() before applying collect_list().
The order of elements in the list is not guaranteed.

In PySpark, you can use collect_list along with struct to collect multiple columns into a list of structs. This is useful when you want to retain multiple attributes for each group while collecting the data.

Example: -

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list, struct

# Create a Spark session
spark = SparkSession.builder.appName("StructInCollectListExample").getOrCreate()

# Sample data
data = [
    ("Alice", "Apple", 10),
    ("Alice", "Orange", 15),
    ("Bob", "Banana", 20),
    ("Bob", "Grapes", 25),
]

# Create a DataFrame
df = spark.createDataFrame(data, ["name", "item", "quantity"])

# Group by 'name' and collect 'item' and 'quantity' into a list of structs
result = df.groupBy("name").agg(collect_list(struct("item", "quantity")).alias("items"))

# Show the result
result.show(truncate=False)

+-----+----------------------------------+
| name|items                             |
+-----+----------------------------------+
|Alice|[{Apple, 10}, {Orange, 15}]       |
|Bob  |[{Banana, 20}, {Grapes, 25}]      |
+-----+----------------------------------+

To work with the collected data, you can expand or manipulate the struct fields:

Access Individual Fields

Use explode or other functions to work with the fields inside the struct.