In PySpark, collect_list() is an aggregate function available in the pyspark.sql.functions module. It is used to aggregate data by collecting all the values from a specified column into a single list for each group in a grouped dataset.
Syntax: -
pyspark.sql.functions.collect_list(col)
Key Points
- Use collect_list() when you want to aggregate all values in a column into a list, including duplicates.
- If you need a list with unique values, use .distinct() before applying collect_list().
- The order of elements in the list is not guaranteed.
In PySpark, you can use collect_list along with struct to collect multiple columns into a list of structs. This is useful when you want to retain multiple attributes for each group while collecting the data.
Example: -
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list, struct
# Create a Spark session
spark = SparkSession.builder.appName("StructInCollectListExample").getOrCreate()
# Sample data
data = [
("Alice", "Apple", 10),
("Alice", "Orange", 15),
("Bob", "Banana", 20),
("Bob", "Grapes", 25),
]
# Create a DataFrame
df = spark.createDataFrame(data, ["name", "item", "quantity"])
# Group by 'name' and collect 'item' and 'quantity' into a list of structs
result = df.groupBy("name").agg(collect_list(struct("item", "quantity")).alias("items"))
# Show the result
result.show(truncate=False)
+-----+----------------------------------+
| name|items |
+-----+----------------------------------+
|Alice|[{Apple, 10}, {Orange, 15}] |
|Bob |[{Banana, 20}, {Grapes, 25}] |
+-----+----------------------------------+
To work with the collected data, you can expand or manipulate the struct fields:
Access Individual Fields
Use explode or other functions to work with the fields inside the struct.