join()

#DataframeMethod
In PySpark, .join() is a method used to perform joins between two DataFrames or RDDs. It allows you to combine data from multiple datasets based on a common key.

Syntax: -

DataFrame.join(other, on=None, how=None)

other: The DataFrame to join with.
on: The column(s) to join on. Can be a string (single column) or a list of strings (multiple columns). If not specified, a Cartesian join is performed.
how: Specifies the type of join to perform. The options are:
- "inner" (default): Returns matching rows from both DataFrames.
- "outer": Returns all rows from both DataFrames, filling with null where there is no match.
- "left": Returns all rows from the left DataFrame, with matching rows from the right DataFrame.
- "right": Returns all rows from the right DataFrame, with matching rows from the left DataFrame.
- "semi": Returns rows from the left DataFrame for which there is a match in the right DataFrame.
- "anti": Returns rows from the left DataFrame for which there is no match in the right DataFrame.

For performance, consider using broadcast joins for smaller DataFrames. Example: -

from pyspark.sql.functions import broadcast
result = df1.join(broadcast(df2), on="id", how="inner")