Apache Airflow and Regression Monitoring

Image credit: ApacheAirflow

Introduction:

Airflow is a platform to programmatically author, schedule and monitor workflows or data pipelines. It was originally developed and open sourced by Airbnb, later joined Apache Software foundation’s incubation program in 2016. Workflow is a sequence of tasks defined around Directed Acyclic Graph(DAGs) – which could be started on a schedule or triggered by an event or using Command line interface. Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiate pipelines dynamically.

Components:

alt Components

Metadata DB:

Stores information’s like job status and task instance status.

Scheduler:

Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. The scheduler is the brains behind setting up the workflows in airflow. The execution time begins at DAG start date and repeat every schedule interval.

Web Interface (UI):

Airflow ships with a Flask app that tracks all the defined workflows and lets you easily change, start or stop them. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues.

CLI:

Airflow has a very rich command line interface that allows to test, run, backfill, describe and clear parts of your DAGs

Concepts:

DAG:

A DAG is the container that is used to organize tasks in a way that reflects their relationship, dependencies and set their execution context and order.

Operators:

Operators are the worker that run the tasks. Workflows are defined by creating a DAG of operators. They are broadly classified into three – Sensors, Operators and Transfers. Airflow provides many prebuild operators for many common tasks and new operators can be created by inheriting BaseOperator class.

Tasks:

Once an operator is instantiated, its is referred to as a “task”. Each task is user defined and responsible for performing a specific operation in the workflow. Instantiating a task requires providing a unique task_id and DAG container. Task can be python function or external scripts that could be invoked.

Example:

alt Example

In the example, we show case - how Airflow could be used to express a workflow that can be used to generate the statistics/ report as part of end-to-end regression test suit; which involves multiple systems to work together. A traditional approach would use something very basic like bunch of batch scripts w/o CRON. But the challenge is - it would very easily get tangled and developer would spend a lot of time to figure out where the log files are or what failed and why/who owns what. Airflow helps solves this problem by helping in orchestrating your processes, managing the logs and really good dashboard with visualization of what failed and much more information.

alt CodeSnippet

References:

Jacob Aloysious
Jacob Aloysious
Software Enthusiast

35yr old coder, father and spouse - my interests include Software Architecture, CI/CD, TDD, Clean Code.

Related