logo le blog invivoo blanc

Apache Airflow: What is it and why you should start using it

11 February 2020 | Big Data | 0 comments

In this data-driven era, the number of open-source Big Data technologies rose exponentially in a matter of a few years. Because of this multitude of options, it results in the introduction of a vast range of patterns and architectures to store, process, and visualize data.

Each company getting into Big Data was able to cherry-pick the set of technologies that best suit its specific needs. Although these needs diverge from one company to another, one crucial element remains the same; having a reliable data workflow management system.

Apache Airflow is the one technology that rapidly managed to become the de-facto choice to manage data workflows. After first being introduced by Airbnb back in 2015, Airflow gained a lot of popularity thanks to its robustness and its flexibility through the use of Python.

Airflow’s biggest perk is that it relies on code to define its workflows. Its users have a complete freedom to write what code to execute at every step of the pipeline. This makes the possibilities endless when working with Airflow since the tool itself doesn’t impose any restrictions on how your workflows should function.

Nonetheless, Airflow also offers a very powerful and well-equipped UI. That makes tracking jobs, rerunning them, and configuring the platform a very easy task:

Airflow's UI

 

Airflow’s UI

Other than its resourceful UI, Airflow relies mainly on four core elements that allow it to simplify any given pipeline:

  • DAGs (Directed Acyclic Graphs): Airflow relies on DAGs to structure batch jobs in an extremely efficient way. With this concept, it offers you the chance to structure your pipeline in the most suitable way with no hurdle or additional complexity.

Example of an Airflow DAG

 

Example of an Airflow DAG

  • Tasks: Airflow’s DAGs are divided into tasks that actually contain the code that is being executed. Through its operators, Airflow offers you the possibility to put whatever you want in these tasks. Whether it’s a python function or even an Apache Spark job.

 A task being executed within an Airflow DAG

 

A task being executed within an Airflow DAG

  • Scheduler: One of Airflow’s most prominent advantages is the fact that it comes with its own scheduler. Unlike many other workflow management tools within the Big Data ecosystem. This additional feature makes working with Airflow even more convenient because it can take care of scheduling on its own.

 Airflow’s scheduler

 

Airflow’s scheduler

  • X-COM: Through X-COM functions, Airflow allows its users to pass information between different tasks via the use of its own database. This feature makes working with complex pipelines much simpler. Thanks to the option of passing certain variables throughout the pipeline without persisting them using a different tool.

 

Example of an X-COM variable

Airflow also benefits from a very large community and an elaborate documentation that leaves no questions unanswered. Among the companies that are heavily betting on the platform and using it extensively, we can mention BloombergAlibaba, and Lyft.

Thanks to this set of features and to the simplicity with which you can configure an Airflow server (you can do so via a small set of commands), it’s no wonder that Airflow rapidly managed to establish itself as the go-to workflow management platform of this era.

Interrested in Big Data? Have a look at our other article!