logo le blog invivoo blanc

Creating your first Apache Airflow DAG

15 August 2023 | Big Data | 0 comments

Throughout the past few years, Apache Airflow has established itself as the go-to data workflow management tool within any modern tech ecosystem. One of the main reasons for which Airflow rapidly became this popular is its simplicity and how easy it is to get it up and running. This article will guide you through the first steps with Apache Airflow towards the creation of your first Directed Acyclic Graph (DAG).

Installing Airflow

Airflow’s ease-of-use starts right from the installation process because it only requires one pip command to get all of its components:

pip install apache-airflow

Adding external packages to support certain features (like compatibility with your cloud provider) is also a seamless operation. So if we opt to add the Microsoft Azure subpackage, the command would be as follows:

pip install 'apache-airflow[azure]'

Afterward, you only need to initiate a database for Airflow to store its own data. The recommended option is to start with Airflow’s own SQLite database, but you can also connect it to another option. To initiate the database, you only need to run the following command:

airflow initdb

Creating your first DAG

In a previous article on INVIVOO’s blog, we presented the main concepts that Airflow relies on. One of these concepts is the usage of DAGs, which allows Airflow to organize the multiple tasks and processes that it needs to run very fluidly.

In Airflow, a DAG is simply a Python script that contains a set of tasks and their dependencies. What each task does is determined by the task’s operator. For example, using PythonOperator to define a task means that the task will consist of running Python code.

To create our first DAG, let’s first start by importing the necessary modules:

# We'll start by importing the DAG object
from airflow import DAG
# We need to import the operators used in our tasks
from airflow.operators.bash_operator import BashOperator
# We then import the days_ago function
from airflow.utils.dates import days_ago

We can then define a dictionary containing the default arguments that we want to pass to our DAG. These arguments will then be applied to all of the DAG’s operators.

# initializing the default arguments that we'll pass to our DAG
default_args = {
    'owner': 'airflow',
    'start_date': days_ago(5),
    'email': ['airflow@my_first_dag.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

As we can see, Airflow offers a big number of default arguments that make DAG configuration even simpler. For example, we can easily define the number of retries and the retry delay for the DAG’s runs.

The DAG itself can then be instantiated as follows:

my_first_dag = DAG(
    'first_dag',
    default_args=default_args,
    description='Our first DAG',
    schedule_interval=timedelta(days=1),
)

The first parameter “first_dag” represents the DAG’s ID, and the schedule interval represents the interval between two runs of our DAG.

The next step consists of defining the tasks of our DAG:

task_1 = BashOperator(
    task_id='first_task',
    bash_command='echo 1',
    dag=my_first_dag,
)

task_2 = BashOperator(
    task_id='second_task',
    bash_command='echo 2',
    dag=my_first_dag,
)

Lastly, we just need to specify the dependencies. In our case, we want task_2 to run after task_1:

task_1.set_downstream(task_2)

Running the DAG

By default, DAGs should be placed in the ~/airflow/dags folder. We can first test our different tasks using the airflow test command, and then when we’ve verified that everything is configured correctly, we can use the airflow backfill command to run our DAG for a specific range of dates:

airflow backfill my_first_dag -s 2020-03-01 -e 2020-03-05

Finally, we only need to launch Airflow’s scheduler with the airflow scheduler command and Airflow will make sure to run our DAG with the defined interval.