When to cache an Apache Spark DataFrame?

When should we cache a Spark DataFrame? Wouldn’t it be easier to cache everything? Well, no. That would be a terrible idea because when we use the cache function, Spark is going to spend some time serializing and storing the DataFrame even if we don’t reuse it later.

Table of Contents

  1. Cache only what is reused
  2. Make sure that you have all of the columns
  3. Use unpersist (sometimes)

In my opinion, there are three rules and guidelines regarding caching in Apache Spark:

Cache only what is reused

It is crucial to remember that caching a DataFrame that is used only once is a waste of resources and makes no sense. We should never do it.

Make sure that you have all of the columns

If in one statement, we use columns A, B, and C, but the other statement needs columns B, C, D, it makes no sense to cache any of those DataFrames. In this situation, we should cache the superset that contains all of the columns we are going to need (A, B, C, D), so both statements can use the cached data.

# This does not make sense:

firstDf = df.select('A', 'B', 'C').cache()
secondDf = df.select('B', 'C', 'D').cache()
... some operations that use firstDf and secondDf multiple times

Everything that happens before those cache function calls must be calculated twice, and we keep two copies of B, C columns. Instead of that, let’s do it like this:

superSet = df.select('A', 'B', 'C', 'D').cache()

firstDf = superSet.select('A', 'B', 'C')
secondDf = superSet.select('B', 'C', 'D')

Use unpersist (sometimes)

Usually, instructing Spark to remove a cached DataFrame is overkill and makes as much sense as assigning a null to no longer used local variable in a Java method. However, there is one exception.

Imagine that I have cached three DataFrames:

firstDf = df.something.cache()
secondDf = df.something.cache()
thirdDf = df.something.cache()

Now, I would like to cache more DataFrames, but I know that I no longer need the third DataFrame. I can use unpersist to tell Spark what it can remove from the cache. Otherwise, it uses the least-recently-used method and may remove something I will want to use later. Therefore, by telling Spark what I no longer need, I may avoid waiting until Spark recomputes something removed just because it was a long time since I used it.

Older post

How to flatten a struct in a Spark DataFrame?

How to convert DataFrame fields into separate columns.

Newer post

How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive

How to use the saveAsTable function to create a partitioned table

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn. Schedule a call or send me a message on LinkedIn

>