- Applicable to: DataFrames.
- Functionality: Returns a DataFrame that contains rows from the first DataFrame that are not in the second DataFrame, including duplicates.
- Duplicates: Keeps duplicates in the output if they exist in the first DataFrame but not in the second
Example: -
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("exceptAll Example").getOrCreate()
# Create two DataFrames
data1 = [(1, 'Alice'), (2, 'Bob'), (2, 'Bob'), (3, 'Charlie')]
data2 = [(2, 'Bob'), (3, 'Charlie')]
columns = ["id", "name"]
df1 = spark.createDataFrame(data1, columns)
df2 = spark.createDataFrame(data2, columns)
# Use exceptAll() to get rows in df1 that are not in df2
result = df1.exceptAll(df2)
result.show()
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
| Feature | subtract() | exceptAll() |
|---|---|---|
| Applicable To | RDDs | DataFrames |
| Duplicates | Does not include duplicates | Includes duplicates |
| Schema Support | No schema (operates on RDD elements) | Requires identical schemas |
| Output Type | RDD | DataFrame |
| Complexity | Simpler; works at RDD level | Supports structured data |