unix_timestamp()

#Function

The unix_timestamp() function in PySpark is used to convert a timestamp or date string into the number of seconds since January 1, 1970 (epoch time). This function is particularly useful for working with timestamp differences, conversions, or calculations based on time.

Syntax

unix_timestamp(column, format=None)

column: The column containing a timestamp or date string.
format (optional): The date format of the input string. If not provided, it uses the default format: "yyyy-MM-dd HH:mm:ss".

Difference Between Two Timestamps

Calculate the time difference between two timestamp columns using unix_timestamp().

data = [
    ("2024-12-31 08:30:00", "2024-12-31 10:30:00"),
    ("2024-12-31 09:00:00", "2024-12-31 17:00:00"),
]
columns = ["start_time", "end_time"]

df = spark.createDataFrame(data, columns)

# Calculate time difference in seconds
df = df.withColumn("time_diff_seconds", unix_timestamp("end_time") - unix_timestamp("start_time"))

df.show()

+-------------------+-------------------+----------------+
|        start_time |          end_time |time_diff_seconds|
+-------------------+-------------------+----------------+
|2024-12-31 08:30:00|2024-12-31 10:30:00|            7200|
|2024-12-31 09:00:00|2024-12-31 17:00:00|           28800|
+-------------------+-------------------+----------------+

Important Notes

Default Format: If the timestamp strings match the default format (yyyy-MM-dd HH:mm:ss), the format argument can be omitted.
Date-only Strings: You can use unix_timestamp() with date-only strings (yyyy-MM-dd) to get the start of the day in epoch seconds.
Returns None for Invalid Dates: If the input does not match the specified format, the function will return null.