Problem Statement: -

Find the names of the customer that are not referred by the customer with id = 2.
Return the result table in any order.

Pasted image 20241225153926.png

Dataframe API Solution: -

schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("referee_id", IntegerType(), True) ])
# Create an RDD with the data 
data = [ (1, 'Will', None), (2, 'Jane', None), (3, 'Alex', 2), (4, 'Bill', None), (5, 'Zack', 1), (6, 'Mark', 2) ]

df = spark.createDataFrame(data,schema)
df.show()
+---+----+----------+
| id|name|referee_id|
+---+----+----------+
|  1|Will|      NULL|
|  2|Jane|      NULL|
|  3|Alex|         2|
|  4|Bill|      NULL|
|  5|Zack|         1|
|  6|Mark|         2|
+---+----+----------+
result = df.filter((col("referee_id")!=2)|col("referee_id").isNull()).select("name")

result.show()
+----+
|name|
+----+
|Will|
|Jane|
|Bill|
|Zack|
+----+

Spark SQL Solution: -

df.createOrReplaceTempView("emp")

result = spark.sql("""
SELECT 
    name
FROM emp
WHERE referee_id IS NULL OR referee_id !=2
""")

result.show()