ETL Pipeline

ETL is an acronym for Extract, Transform and Load used for extraction of data from a source, its transformation, and then loading into target database/ data warehouse.

During Extraction, data is extracted from several heterogeneous sources. For example, business systems, applications, sensors, and databanks. The next stage involves data transformation in which raw data is converted into a format that can be used by various applications.

Apache Airflow

The Airflow platform is a tool for describing, executing, and monitoring workflows. Airflow was started in October 2014 by Maxime Beauchemin at Airbnb. It was open source from the very first commit and officially brought under the Airbnb GitHub and announced in June 2015.

Tasks can be scheduled together using time-based scheduler like Cron. If one task complete, another task runs immediately, if the task fails the task graph UI shows the status and sends email notification.

Airflow is an open source project. It ships with a Flask app that tracks all the defined workflows, and let’s you easily change, start, or stop them. You can also work with the command line, but the web interface is more intuitive. All the configurations in Airflow is written in Python.

Concepts

Below are some of the concepts required to understand and run Airflow.

Directed Acyclic Graphs (DAGs)

A DAG is defined in a Python script, which represents the tasks and their dependencies as code. Asimple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10pm, but shouldn’t start until a certain date.

When searching for DAGs, Airflow only considers python files that contain the strings “airflow” and “DAG” by default.