The countDistinct() function in PySpark is used to count the number of distinct elements in a column (or columns) of a DataFrame.
Syntax: -
countDistinct(col1, col2, ...)
Example 1: Count Distinct in a Single Column: -
# Sample data
data = [("Alice", 20), ("Bob", 30), ("Alice", 20), ("Cathy", 40)]
columns = ["Name", "Age"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Count distinct names
distinct_count = df.select(countDistinct("Name").alias("DistinctNames"))
distinct_count.show()
+-------------+
|DistinctNames|
+-------------+
| 3|
+-------------+
Key Points
countDistinct()is often used in aggregation operations and can be used with.agg()for grouped computations.- Counting distinct values on multiple columns counts distinct combinations of the values in those columns.
- It is computationally expensive for large datasets due to shuffling.