Do you think that Airflow scheduler is unintuitive? Have you tried to use intervals and the only method that worked was “trial and error?” Does it look like the only way to set it up correctly is tweaking the settings until somehow you get it right? If yes, this is a blog post for you ;)
Table of Contents
I am going to explain the Airflow scheduler and maybe, just maybe, help you understand how it works.
Example
Imagine that Airflow is a person with a timer. Usually, we use Airflow to move data around, and recently I published summaries of the Data Janitor talks, so let’s assume that Airflow is a janitor.
We set the start_date to 9 am today and the interval to “@hourly”. It means that we have told Airflow to start working at 9 am and move some boxes at the beginning of every hour.
Our janitor comes to work at 9 am as expected. Do you think that he/she starts working right away? Obviously no. First, the coffee has to be made. After that, there is a time for watching or reading the news. No work is done between 9 am and 10 am.
What happens at 10 am? The “hourly” interval has passed, and now it is time to move some data around. The same happens at 11 am and later — every hour. At this, point my example breaks down because the real janitor goes home at 5 pm, but Airflow is going to continue working at repeat the same job every hour. Continuously, even at night.
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Start date
The problem is that “start_date” parameter is counterintuitive. The job does not start at this time. So what starts?
In Airflow start_date is the time when Airflow starts the timer. It just starts measuring the time.