How to use Airflow backfill to run DAGs for a specified date in the past?

Have you created a new Airflow DAG, but now you have to run it using every data snapshot created during the last six months? Don’t worry. You don’t need to experiment with the start_date parameter.

Table of Contents

  1. An example usage of the Airflow backfill feature
  2. How about re-running completed DAGs?
  3. Important note about SSH

An example usage of the Airflow backfill feature

“Backfilling” means running an Airflow DAG for a specified date in the past. The Airflow command-line interface provides a convenient command to run such backfills.

First, I have to log-in to the server that is running the Airflow scheduler. If Airflow is running inside a Docker container, I have to access the command-line of the container, for example like this:

docker exec -it container_id /bin/sh

To run the backfill command, I need three things: the identifier of the DAG, the start date, and the end date (note that Airflow stops one day before the end date, so the end date is not inclusive).

When I have the required information, I can run the command to start backfill. In this case, I am running the test_dag DAG with the execution date set to 2019-01-01, 2019-01-02, and 2019-01-03.

airflow backfill -s 2019-01-01 -e 2019-01-04 test_dag

How about re-running completed DAGs?

The backfill command does not re-run completed DAGs within the given period unless we explicitly instruct it to do so. Therefore, if there was already a DAG run on 2019-01-02 and I would like to repeat it, I have to add –reset_dagruns to the airflow backfill command.

airflow backfill -s 2019-01-01 -e 2019-01-04 --reset_dagruns test_dag

Important note about SSH

Airflow backfill does not schedule all DAGs at once! It starts the first one, waits until it finishes, and then schedules the next one.

Because of that, I need to keep an active SSH connection to the Airflow server until the backfill schedules the last DAG run. If I got disconnected before Airflow had scheduled the last DAG, it would finish the currently running DAG and never schedule the remaining ones.

To avoid problems in case of a lost internet connection, I suggest using the screen application to start a durable terminal session on the Airflow server and run the backfill command inside the screen session.

Older post

What do you need to know about storing passwords in AWS?

How to use the AWS Secrets Manager

Newer post

Three biggest traps to avoid while setting Spark executor memory

Apache Spark is wasting a lot of RAM!

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Book a Quick Consultation, send me a message on LinkedIn. Book a Quick Consultation or send me a message on LinkedIn

>