#DataframeMethod
In PySpark, .join() is a method used to perform joins between two DataFrames or RDDs. It allows you to combine data from multiple datasets based on a common key.
Syntax: -
DataFrame.join(other, on=None, how=None)
other: The DataFrame to join with.on: The column(s) to join on. Can be a string (single column) or a list of strings (multiple columns). If not specified, a Cartesian join is performed.how: Specifies the type of join to perform. The options are:"inner"(default): Returns matching rows from both DataFrames."outer": Returns all rows from both DataFrames, filling withnullwhere there is no match."left": Returns all rows from the left DataFrame, with matching rows from the right DataFrame."right": Returns all rows from the right DataFrame, with matching rows from the left DataFrame."semi": Returns rows from the left DataFrame for which there is a match in the right DataFrame."anti": Returns rows from the left DataFrame for which there is no match in the right DataFrame.
For performance, consider using broadcast joins for smaller DataFrames. Example: -
from pyspark.sql.functions import broadcast
result = df1.join(broadcast(df2), on="id", how="inner")