#Function

The countDistinct() function in PySpark is used to count the number of distinct elements in a column (or columns) of a DataFrame.

Syntax: -

countDistinct(col1, col2, ...)

Example 1: Count Distinct in a Single Column: -

# Sample data
data = [("Alice", 20), ("Bob", 30), ("Alice", 20), ("Cathy", 40)]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Count distinct names
distinct_count = df.select(countDistinct("Name").alias("DistinctNames"))
distinct_count.show()
+-------------+
|DistinctNames|
+-------------+
|            3|
+-------------+

Key Points

  1. countDistinct() is often used in aggregation operations and can be used with .agg() for grouped computations.
  2. Counting distinct values on multiple columns counts distinct combinations of the values in those columns.
  3. It is computationally expensive for large datasets due to shuffling.