Working with dates and time in Apache Spark

Today, I will demonstrate how to work with dates in Apache Spark. In particular, I want to cover things like getting the current date and the current time (with and without the timezone), and calculating a relative date (for example, yesterday, a month ago, the first day of the previous month, etc.).

Table of Contents

The current date and the current time
Changing the timezone
Relative dates
Relative dates with months
The last day of the month
The first day of the previous month

I am going to write the code using PySpark, but the API should work the same in the Scala version of Apache Spark.

The current date and the current time

There are two ways to get the current date in PySpark. We can either get only the date or the date with the time.

Let’s start by creating a DataFrame that contains only one column and one row. In the second step, we will add two columns with the date and timestamp. We must remember that both of those functions return the time in the system timezone.

df_schema = StructType([StructField('a_column', StringType())])
    test_list = [
        ['1']
    ]

    df: DataFrame = spark_session.createDataFrame(test_list, schema=df_schema)

    df \
        .withColumn('current_date', current_date()) \
        .withColumn('current_timestamp', current_timestamp())

+--------+------------+--------------------+
|a_column|current_date|   current_timestamp|
+--------+------------+--------------------+
|       1|  2020-10-04|2020-10-04 11:55:...|
+--------+------------+--------------------+

Changing the timezone

To modify the timezone and get the time in UTC, we must set the spark session timezone using the session.timeZone configuration parameter:

spark_session.conf.set('spark.sql.session.timeZone', 'UTC')

+--------+------------+--------------------+
|a_column|current_date|   current_timestamp|
+--------+------------+--------------------+
|       1|  2020-10-04|2020-10-04 10:16:...|
+--------+------------+--------------------+

Relative dates

To calculate a relative date (for example, yesterday or tomorrow), we should use the date_add and date_sub functions:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('yesterday', date_sub(col('current_date'), 1)) \
        .withColumn('tomorrow', date_add(col('current_date'), 1))

+--------+------------+----------+----------+
|a_column|current_date| yesterday|  tomorrow|
+--------+------------+----------+----------+
|       1|  2020-10-04|2020-10-03|2020-10-05|
+--------+------------+----------+----------+

Relative dates with months

It is easy to add or subtract dates, but what if we wanted to calculate a date in three months? The months vary in length, so we cannot add 90 days and pretend to have the correct value. Instead of that, Spark provides the add_months function, which also works with negative numbers, so we can use it to calculate the date three months ago:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('in_three_months', add_months(col('current_date'), 3)) \
        .withColumn('three_months_ago', add_months(col('current_date'), -3))

+--------+------------+---------------+----------------+
|a_column|current_date|in_three_months|three_months_ago|
+--------+------------+---------------+----------------+
|       1|  2020-10-04|     2021-01-04|      2020-07-04|
+--------+------------+---------------+----------------+

The last day of the month

In addition to that, Spark supports calculating the date of the last day of the given month:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('end_of_the_month', last_day(col('current_date')))

+--------+------------+----------------+
|a_column|current_date|end_of_the_month|
+--------+------------+----------------+
|       1|  2020-10-04|      2020-10-31|
+--------+------------+----------------+

The first day of the previous month

The previous month’s first day is an example of date arithmetics when the operations get complicated. There is no easy way to do that in Spark, either. Instead of that, what we have to calculate the end of the current month, subtract two months, and add one day:

df \
        .withColumn('current_date', current_date()) \
        .withColumn('first_day_of_previous_month', date_add(add_months(last_day(col('current_date')), -2), 1))

+--------+------------+---------------------------+
|a_column|current_date|first_day_of_previous_month|
+--------+------------+---------------------------+
|       1|  2020-10-04|                 2020-09-01|
+--------+------------+---------------------------+