The unix_timestamp() function in PySpark is used to convert a timestamp or date string into the number of seconds since January 1, 1970 (epoch time). This function is particularly useful for working with timestamp differences, conversions, or calculations based on time.
Syntax
unix_timestamp(column, format=None)
column: The column containing a timestamp or date string.format(optional): The date format of the input string. If not provided, it uses the default format:"yyyy-MM-dd HH:mm:ss".
Difference Between Two Timestamps
Calculate the time difference between two timestamp columns using unix_timestamp().
data = [
("2024-12-31 08:30:00", "2024-12-31 10:30:00"),
("2024-12-31 09:00:00", "2024-12-31 17:00:00"),
]
columns = ["start_time", "end_time"]
df = spark.createDataFrame(data, columns)
# Calculate time difference in seconds
df = df.withColumn("time_diff_seconds", unix_timestamp("end_time") - unix_timestamp("start_time"))
df.show()
+-------------------+-------------------+----------------+
| start_time | end_time |time_diff_seconds|
+-------------------+-------------------+----------------+
|2024-12-31 08:30:00|2024-12-31 10:30:00| 7200|
|2024-12-31 09:00:00|2024-12-31 17:00:00| 28800|
+-------------------+-------------------+----------------+
Important Notes
- Default Format: If the timestamp strings match the default format (
yyyy-MM-dd HH:mm:ss), the format argument can be omitted. - Date-only Strings: You can use
unix_timestamp()with date-only strings (yyyy-MM-dd) to get the start of the day in epoch seconds. - Returns
Nonefor Invalid Dates: If the input does not match the specified format, the function will returnnull.