Complex scheduling schemas in Airflow

Complex scheduling schemas in Airflow

Introduction

Recently, we decided to overhaul the way we communicate with users through SMS. As part of this effort, we looked into a straightforward way to schedule batch SMS sending. We first tested dockerizing an application to be scheduled on AWS ECS. But that requires a fair amount of overhead. We wanted to have a running POC as fast as possible. So we thought, why not use Airflow?

At DT One, we use Airflow to run our various ETL (extract, transform, load) pipelines and report generators. It allows us to coordinate tasks, manage their dependencies, monitor their activity and recover from failure. And that’s also what we need from a framework that runs a batch SMS sender. We wanted to schedule the sender, define its dependencies on data extracts and monitor its activity. So we decided to try using Airflow for the mission.

Integrating an SMS sender as an Airflow DAG proved to be a swift process which allowed us to start testing very rapidly. But this came with one big pain point - scheduling. The next sections detail how DAGs are scheduled in Airflow, why we needed to ‘hack’ this scheduling mechanism and its unavoidable limitations.

Scheduling in Airflow

There are three variables with self-explanatory names that control scheduling a task (called DAG) on Airflow:

  • start_date (datetime.datetime object)
  • end_date (datetime.datetime object)
  • schedule_interval (cron expression as str or datetime.timedelta object)

Airflow was conceived for ETL orchestration, and in ETL logic tasks run at the end of their configured period. This is to ensure that the data needed for the given period is available. So if we take a daily task as an example: the run of the 9th of December which waits for the data of the 9th of December to be available can only run once the 9th of December is over. This results in a run that starts on the 10th of December. Because of that, the Airflow scheduler executes the task at the end of the scheduling period. Then, if schedule_interval is set to every 3 days, the task of the 9th of December will be executed on the 12th of December. This is a little unintuitive at first, especially when you are not a Data Engineer. The creators of airflow probably realized this, since the documentation adds the following emphasis:

Airflow documentation note

Fully understanding this logic and successfully applying it to repetitive tasks at fixed intervals is a little challenging at first. But it’s workable (especially because even if you get it wrong at first, the task will eventually run, missing one interval in the worst case).

Things get a bit more complicated though when you need to schedule a task to run 6 times on Monday the 9th of September and Tuesday the 10th of September @ 9:00, 13:00 and 21:00.

Scheduling sporadic runs in Airflow

In a nutshell, since the Airflow scheduler executes the task at the end of each period, the trick is to add one period before the first run. The first time the task will then be executed is at the end of that dummy period, which is the desired timing for the first run. Not adding the dummy period will result in running your first run on Monday the 9th @ 13:00.

What follows is a recipe for how to schedule the example above:

  1. Construct a cron expression as if the task is supposed to run every week. In the given example the cron expression will be:

    0 9,13,21 * * 1,2
    

    Tip: https://crontab.guru is highly recommended (and be minded of the time-zone of the times you provide).

  2. Add the day before as a dummy day (in this case Sunday) to the cron expression.

    0 9,13,21 * * 0,1,2`
    

    This results in a task that runs every Sunday, Monday Tuesday @ 9:00, 13:00 and 21:00.

  3. Set schedule_interval to the cron expression in the previous step.

    schedule_interval = "0 9,13,21 * * 0,1,2"
    
  4. Set start_date to a few minutes before the last run of the dummy day.

    start_date = datetime.datetime(2019, 9, 8, 20, 55)
    

    This tells Airflow that the first task to run is the one of Sunday @ 21:00, which will be executed at the end of the period, on Monday @ 9:00.

  5. Set end_date to after the last run of the last day (but obviously before the runs of the week after)

    end_date = datetime.datetime(2019, 9, 10, 22)
    

    This tells Airflow to stop scheduling this task after the last execution.

What the Airflow scheduler cannot do

Suppose you would like to schedule a task that runs at different times on different days. For example: Monday @ 9:00, 13:00 and 21:00 but Tuesday @ 9:00 and 21:00. In this case, the above recipe breaks. It stretches the Airflow scheduler one step too far. So if you want to do this, the easiest workaround would be scheduling two (or several) DAGs, one per day.

Conclusion

Airflow can be used to schedule more sporadic tasks. It has its scheduling limitations, but one can work around many of them given some mental effort and a few tweaks. It turned out to be suboptimal for our long-term purposes. For our trial purposes though, it was really helpful and allowed us to move fast and create a proof-of-concept batch SMS sender in a matter of days.

At DT One we believe in bringing our ideas into contact with the world as early as possible. This allows us to gather useful feedback which is then turned into insight that helps develop those ideas further. We aim for fast and frequent iterations of ideation, testing and development in order to make sure that we are building the right thing. In the case of SMS batch sending, the use of the Airflow scheduler permitted us to test several ways of communicating with our users in a week’s time. After having done that, we were better equipped to design our own custom SMS batch sending system.

data