4 min read

Getting started with Apache Airflow

Getting started with Apache Airflow

Apache Airflow is a platform that provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow. It is very useful for its ability to handle complex relationship between tasks. Apache Airflow is not only limited to Experimentation, Growth Analytics, Search Ranking, Data Warehousing, Anomaly Detection, Infrastructure Monitoring, Email Targeting.

Apache Airflow Components

Web Server, Scheduler, Executor and Metadata Database

*Web Server- It is the main GUI, where under the hood Flask app runs. It tracks the status of jobs and read logs.

*Scheduler- It is responsible for scheduling jobs. A multithreaded python process that uses the DAGb object to decide what tasks need to be run, when and where.

*Executor- It is the mechanism that gets the tasks done.

*Metadata Database- It is responsible how the other components interact. It stores the Airflow states. All the read and write processes take place from here.

Source: https://www.astronomer.io/guides/airflow-components

Airflow Concepts

DAG: A DAG or Directed Acyclic Graph is a collection of all tasks, units of works in the pipeline. The tasks are organized by their relationships and dependencies between each other. A task can retry but cannot be rerun after it has completed and other task downstream will begin, that is the pipeline can only move forward, not backwards.

Source: https://www.astronomer.io/guides/dags

DAGs describe how to run tasks. They are defined in Python that are placed in 'Dags_folder'. There may be many DAG files in one folder and each DAG file can have multiple tasks.

Operators: Operators tell what is there to be done. They should atomic that is, they should describe a single task in a workflow that doesn't  need to share anything with other operators. Operators should be independent as Airflow makes it possible for a single DAG to use separate machines, so it's best for the operators to be independent. Operators provided by Airflow:

*Bash Operator- used for bash commands.                                                                         *PythonOperator- used to call python functions.                                                             *EmailOperator- used for sending emails.                                                                         *SimpleHttpOperator- used for calling http requests and response texts.                 *DB Operator- eg: MySqlOperator, SqlliteOperator, PostgresOperator,                                                      OracleOperator, used for executing sql commands.                               *Sensor- used to wait for certain event(like a file or row in a DB) or time.

Task: A task is an instance of an Operator. Operators refer to tasks that they                            execute.

Here, 'energy_operator' is an instance of PythonOperator with task id which is 'report_blackouts'.

How to install and use Apache Airflow

Prerequisites for Airflow:                                                                                                                        Python: 3.6, 3.7, 3.8                                                                                                                      Databases: PostgreSQL: 9.6, 10, 11, 12, 13,                                                                                                   MySQL: 5.7, 8                                                                                                                               SQLite: 3.15.0+                                                                                           The official way to install Airflow is via pip. Run this command in your Windows terminal 'pip install apache-airflow', after that pip will take care of the rest of the installation. With addition to that, Airflow is also available on Dockerhub as a Docker Image.

Steps to use Apache Airflow in Docker- In Docker, Airflow can be used as a container which the UI of the Airflow is shown on localhost:8080 by default on the PC. Here are the few step you can use Airflow in Docker.

Step 1: Make a directory of your project.                                        

Project Directory

Step 2: Put your Dockerfile in a created folder named dockerfiles.                               Step 3: Make a 'docker compose file' which would be in yaml format.    

Example of a Docker compose file

Step 4: Create a folder name 'dags' and put your python DAG file in that folder.    

Example of Python DAG file

Step 5: Use volume map in 'docker compose file' to mount 'dags' folder.                         Step 6: In that project directory, open the terminal and run the command 'docker-                  compose up --build"                                                                                       By default, on localhost:8080, the Airflow UI would be shown.  

Example of Airflow UI on the localhost

To make any changes in the workflow, you have to configure the python DAG file which is in the 'dags' folder.

Final Words

Airflow provides a platform that can just run all the jobs on a schedule, adding more and more jobs as needed. Also, it makes easy to do potentially large data operations. With addition to that, Apache Airflow is Dependency Management supportable, extensible, scalable and Open Source.

References:                                                      

1. https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html                         2. https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a                                                                                                               3. https://www.astronomer.io/guides/intro-to-airflow