Airflow Scheduler High Availability with Docker

Kaden Cho
3 min readAug 29, 2020

Today, I’ll cover Airflow High Availability on the scheduler including formerly covered worker HA by many others. The Airflow scheduler was the only SPOF(single-point-of-failure) before and there have been many detours to keep scheduler running in a disaster situation. Here, I tried kind of POC of Airflow HA using Clairvoyant’s Failover Controller on Docker environment.

Followings are pushed into my github page, so you can just clone and docker-compose up and then get your own airflow HA enabled docker containers on your machine.

I’d set up a docker-compose file to start.

Let’s focus on master and worker services to address Airflow HA related points.

  • ‘init: true’ : to avoid producing zombie processes of sshd. Normally, lots of tutorials recommend using supervisord but I skipped it to focus solely on the core of Airflow HA. Anyway, the Failover Controller needs ssh between target machines.
  • ‘hostname’: to make sure where I’m working in
  • ‘volumes’: inject airflow.cfg which was pre-edited with additional configurations for the failover controller, sshd_config and dummy sshd keys (Do not use this on production)

After you throw ‘docker-compose up’ command, you can get containers which look like this:

You have two containers for the Airflow processes(webserver, scheduler, worker and sshd), one for MySQL to persist the state of Airflow objects and remain one for RabbitMQ to queue the tasks of Airflow.

Then, you can check the clear UI of the Airflow, on localhost:8080:

If you enter ‘ps aux’ in each container, here we have the following processes:

On mater container,

  • PID1: ‘docker-init’ed entrypoint process which triggered all the other processes
  • PID7: failover controller process which regularly checks and monitors the liveness of a scheduler process and trigger new one when there’s none
  • PID15: ssh daemon
  • PID32: webserver
  • Others: gunicorn processes for webserver

On worker container(without duplicate),

Now, let’s test the failover by manually drop the scheduler process. After you killed the processes of the scheduler, you could see the respawned scheduler processes in seconds.

If you drop the worker container, you can see the scheduler processes in the master container:

Thank you for reading and I hope that you can get a basic understanding of how the HA with Failover Controller work and where to look when you need something!

--

--