Today, I will demonstrate how to work with dates in Apache Spark. In particular, I want to cover things like getting the current date and the current time (with and without the timezone), and calculating a relative date (for example, yesterday, a month ago, the first day of the previous month, etc.).
Table of Contents
- The current date and the current time
- Changing the timezone
- Relative dates
- Relative dates with months
- The last day of the month
- The first day of the previous month
I am going to write the code using PySpark, but the API should work the same in the Scala version of Apache Spark.
The current date and the current time
There are two ways to get the current date in PySpark. We can either get only the date or the date with the time.
Let’s start by creating a DataFrame that contains only one column and one row. In the second step, we will add two columns with the date and timestamp. We must remember that both of those functions return the time in the system timezone.
df_schema = StructType([StructField('a_column', StringType())])
test_list = [
['1']
]
df: DataFrame = spark_session.createDataFrame(test_list, schema=df_schema)
df \
.withColumn('current_date', current_date()) \
.withColumn('current_timestamp', current_timestamp())
+--------+------------+--------------------+
|a_column|current_date| current_timestamp|
+--------+------------+--------------------+
| 1| 2020-10-04|2020-10-04 11:55:...|
+--------+------------+--------------------+
Changing the timezone
To modify the timezone and get the time in UTC, we must set the spark session timezone using the session.timeZone
configuration parameter:
spark_session.conf.set('spark.sql.session.timeZone', 'UTC')
+--------+------------+--------------------+
|a_column|current_date| current_timestamp|
+--------+------------+--------------------+
| 1| 2020-10-04|2020-10-04 10:16:...|
+--------+------------+--------------------+
Relative dates
To calculate a relative date (for example, yesterday or tomorrow), we should use the date_add
and date_sub
functions:
df \
.withColumn('current_date', current_date()) \
.withColumn('yesterday', date_sub(col('current_date'), 1)) \
.withColumn('tomorrow', date_add(col('current_date'), 1))
+--------+------------+----------+----------+
|a_column|current_date| yesterday| tomorrow|
+--------+------------+----------+----------+
| 1| 2020-10-04|2020-10-03|2020-10-05|
+--------+------------+----------+----------+
Relative dates with months
It is easy to add or subtract dates, but what if we wanted to calculate a date in three months? The months vary in length, so we cannot add 90 days and pretend to have the correct value. Instead of that, Spark provides the add_months
function, which also works with negative numbers, so we can use it to calculate the date three months ago:
df \
.withColumn('current_date', current_date()) \
.withColumn('in_three_months', add_months(col('current_date'), 3)) \
.withColumn('three_months_ago', add_months(col('current_date'), -3))
+--------+------------+---------------+----------------+
|a_column|current_date|in_three_months|three_months_ago|
+--------+------------+---------------+----------------+
| 1| 2020-10-04| 2021-01-04| 2020-07-04|
+--------+------------+---------------+----------------+
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
The last day of the month
In addition to that, Spark supports calculating the date of the last day of the given month:
df \
.withColumn('current_date', current_date()) \
.withColumn('end_of_the_month', last_day(col('current_date')))
+--------+------------+----------------+
|a_column|current_date|end_of_the_month|
+--------+------------+----------------+
| 1| 2020-10-04| 2020-10-31|
+--------+------------+----------------+
The first day of the previous month
The previous month’s first day is an example of date arithmetics when the operations get complicated. There is no easy way to do that in Spark, either. Instead of that, what we have to calculate the end of the current month, subtract two months, and add one day:
df \
.withColumn('current_date', current_date()) \
.withColumn('first_day_of_previous_month', date_add(add_months(last_day(col('current_date')), -2), 1))
+--------+------------+---------------------------+
|a_column|current_date|first_day_of_previous_month|
+--------+------------+---------------------------+
| 1| 2020-10-04| 2020-09-01|
+--------+------------+---------------------------+