Step-by-Step Guide: Build Your First Local Airflow DAG in Just 5 Minutes

Photo by EJ Strat on Unsplash

Step-by-Step Guide: Build Your First Local Airflow DAG in Just 5 Minutes

Apache Airflow is an open-source platform used to schedule and manage workflows. With Airflow, you can automate complex workflows such as data ingestion, ETL processes, and machine learning pipelines. In this blog, we’ll walk through the steps to set up and create your first Airflow Directed Acyclic Graph (DAG) locally on an Airflow standalone instance in Windows using Windows Subsystem for Linux (WSL).

If you're on Windows, you must install and use WSL, which provides a Linux environment to run Airflow seamlessly. In upcoming blogs, we will explore Docker-based setups and more, but for now, let's dive into the WSL-based setup.

Prerequisites

Before you begin setting up Airflow, ensure you have the following prerequisites:

  1. Windows Subsystem for Linux (WSL): You need to install WSL to create a Linux environment on your Windows machine. You can follow the official WSL installation guide if you haven't installed it yet.

  2. Python: Make sure Python is installed. Airflow 2.10.4 supports Python versions 3.6 to 3.11. I used Python 3.11 for this setup.

  3. Linux Commands: You will be working with Linux commands through WSL, so basic knowledge of Linux shell commands will be useful.

  4. Admin Permissions**: You need admin permissions for WSL and Powershell

Step 1: Installing WSL and Setting Up Airflow

1.1 Install WSL

To begin, open PowerShell as an administrator and run the following command to install WSL:

wsl --install

This command installs both WSL and the Ubuntu distribution of Linux. If you already have WSL installed, you can skip this step.

1.2 Install Python and pip

Once WSL is set up, you can open your WSL terminal (by typing wsl in PowerShell) or from Windows Search and then select the Run as administrator option. Proceed by installing Python3 and pip by running:

sudo apt update
sudo apt install python3 python3-pip

This will install the latest version of Python 3, which is required for Airflow.

Use the command python3 --version to get your Python version.

1.3 Install Apache Airflow

With Python and pip installed, it's time to install Apache Airflow. Airflow has some specific dependencies, so you need to use a constraint file that matches your Python version. Run the following command to install Airflow 2.10.4:

AIRFLOW_VERSION=2.10.4

# Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually.
# See above for supported versions.
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"

CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example this would install 2.10.4 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.10.4/constraints-3.8.txt

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

This will install Airflow along with all required dependencies for Python 3.11. If you’re using a different version of Python, be sure to adjust the URL for the constraints file accordingly.

I’ve put 2.10.4 as a value for the variable ARIFLOW_VERSION and 3.11 for PYTHON_VERSION and this is what my final command looks like:

pip install "apache-airflow==2.10.4" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.4/constraints-3.11.txt"

Step 2: Starting Airflow in Standalone Mode

With Airflow installed, you can now start the service. In your WSL terminal, run:

airflow standalone

This command starts the Airflow web server and scheduler in standalone mode, which means you don't need to configure a separate database. Airflow will automatically set up everything locally for you.

Step 3: Accessing the Airflow UI

Once Airflow is up and running, you can access the Airflow UI through your browser. Open any browser and go to:

http://127.0.0.1:8080

Check the CLI after executing the airflow standalone command, you’ll be provided login credentials for your Airflow web interface. You can also get your credentials from the file standalone_admin_password.txt inside the /airflow the directory inside your root folder. You’ll see the web interface where you can monitor and manage your workflows.

Step 4: Create Your First Airflow DAG

A Directed Acyclic Graph (DAG) is the core unit of work in Airflow. It represents a workflow, composed of tasks, where each task can depend on others. In this example, we’ll create a simple DAG that includes two tasks: printing the current date and time and sleeping for 5 seconds.

4.1 Create the DAG Python File

Airflow looks for DAGs in the dags directory, which is defined in the airflow.cfg file. By default, this folder is located in ~/airflow. In the WSL terminal, navigate to this directory and create a new Python file for your DAG, for example, sample_dag.py.

Here’s the code for a simple DAG:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

# Define a simple Python function for the tasks
def print_date():
    from datetime import datetime
    print(f"Current date and time: {datetime.now()}")

def sleep_task():
    import time
    time.sleep(5)
    print("Sleep task completed!")

# Default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
with DAG(
    dag_id='sample_dag',
    default_args=default_args,
    description='A simple example DAG',
    schedule_interval=timedelta(days=1),
    start_date=days_ago(1),  # Start 1 day ago
    catchup=False,  # Skip catching up for missed runs
    tags=['example'],
) as dag:

    # Task 1: Print the current date and time
    task_print_date = PythonOperator(
        task_id='print_date',
        python_callable=print_date,
    )

    # Task 2: Sleep for 5 seconds
    task_sleep = PythonOperator(
        task_id='sleep_task',
        python_callable=sleep_task,
    )

    # Define task dependencies
    task_print_date >> task_sleep

4.2 Place the DAG File in the Correct Directory

Ensure the Python file for the DAG is placed in the /airflow/dags folder. You can verify the exact location by checking the dags_folder value in the airflow.cfg file.

Attaching a snapshot of a part of my airflow.cfg file for reference.

4.3 Understand the DAG Code

  • PythonOperator: This operator allows you to execute a Python function as a task within the DAG.

  • Dependencies: The line task_print_date >> task_sleep defines the order of execution, ensuring that task_print_date runs before task_sleep.

Step 5: Trigger the DAG

After you’ve created the DAG, return to the Airflow UI, refresh the page, and you should see the new DAG listed under the "DAGs" section. You may need to unpause the DAG for it to run. To unpause:

  1. Click on the toggle button to unpause the DAG.

  2. After unpausing, Airflow will automatically trigger the DAG as per its schedule. You can also manually trigger it from the UI by clicking the play button.

Step 6: Monitor Your DAG

Once the DAG runs, you can monitor its execution in the Airflow UI. You'll see logs for each task, which helps in debugging or checking the progress of each task in the DAG.

Conclusion

In this blog, we've learned how to set up Apache Airflow locally using WSL, create a simple DAG with Python tasks, and monitor it through the Airflow UI. WSL provides an excellent way to run Airflow on Windows, and with just a few commands, you can automate complex workflows easily.

Stay tuned for upcoming blogs where I’ll show how to set up Airflow using Docker and discuss more advanced Airflow features, and please let me know if you’re facing any issues with the setup :)

Join the DevHub community for more such articles, free resources, job opportunities, and much more!