Step-by-Step Guide: Build Your First Local Airflow DAG in Just 5 Minutes
Apache Airflow is an open-source platform used to schedule and manage workflows. With Airflow, you can automate complex workflows such as data ingestion, ETL processes, and machine learning pipelines. In this blog, we’ll walk through the steps to set up and create your first Airflow Directed Acyclic Graph (DAG) locally on an Airflow standalone instance in Windows using Windows Subsystem for Linux (WSL).
If you're on Windows, you must install and use WSL, which provides a Linux environment to run Airflow seamlessly. In upcoming blogs, we will explore Docker-based setups and more, but for now, let's dive into the WSL-based setup.
Prerequisites
Before you begin setting up Airflow, ensure you have the following prerequisites:
Windows Subsystem for Linux (WSL): You need to install WSL to create a Linux environment on your Windows machine. You can follow the official WSL installation guide if you haven't installed it yet.
Python: Make sure Python is installed. Airflow 2.10.4 supports Python versions 3.6 to 3.11. I used Python 3.11 for this setup.
Linux Commands: You will be working with Linux commands through WSL, so basic knowledge of Linux shell commands will be useful.
Admin Permissions**: You need admin permissions for WSL and Powershell
Step 1: Installing WSL and Setting Up Airflow
1.1 Install WSL
To begin, open PowerShell as an administrator and run the following command to install WSL:
wsl --install
This command installs both WSL and the Ubuntu distribution of Linux. If you already have WSL installed, you can skip this step.
1.2 Install Python and pip
Once WSL is set up, you can open your WSL terminal (by typing wsl
in PowerShell) or from Windows Search and then select the Run as administrator
option. Proceed by installing Python3 and pip by running:
sudo apt update
sudo apt install python3 python3-pip
This will install the latest version of Python 3, which is required for Airflow.
Use the command python3 --version
to get your Python version.
1.3 Install Apache Airflow
With Python and pip installed, it's time to install Apache Airflow. Airflow has some specific dependencies, so you need to use a constraint file that matches your Python version. Run the following command to install Airflow 2.10.4:
AIRFLOW_VERSION=2.10.4
# Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually.
# See above for supported versions.
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example this would install 2.10.4 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.10.4/constraints-3.8.txt
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
This will install Airflow along with all required dependencies for Python 3.11. If you’re using a different version of Python, be sure to adjust the URL for the constraints file accordingly.
I’ve put 2.10.4 as a value for the variable ARIFLOW_VERSION and 3.11 for PYTHON_VERSION and this is what my final command looks like:
pip install "apache-airflow==2.10.4" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.4/constraints-3.11.txt"
Step 2: Starting Airflow in Standalone Mode
With Airflow installed, you can now start the service. In your WSL terminal, run:
airflow standalone
This command starts the Airflow web server and scheduler in standalone mode, which means you don't need to configure a separate database. Airflow will automatically set up everything locally for you.
Step 3: Accessing the Airflow UI
Once Airflow is up and running, you can access the Airflow UI through your browser. Open any browser and go to:
http://127.0.0.1:8080
Check the CLI after executing the airflow standalone
command, you’ll be provided login credentials for your Airflow web interface. You can also get your credentials from the file standalone_admin_password.txt
inside the /airflow
the directory inside your root folder. You’ll see the web interface where you can monitor and manage your workflows.
Step 4: Create Your First Airflow DAG
A Directed Acyclic Graph (DAG) is the core unit of work in Airflow. It represents a workflow, composed of tasks, where each task can depend on others. In this example, we’ll create a simple DAG that includes two tasks: printing the current date and time and sleeping for 5 seconds.
4.1 Create the DAG Python File
Airflow looks for DAGs in the dags
directory, which is defined in the airflow.cfg
file. By default, this folder is located in ~/airflow
. In the WSL terminal, navigate to this directory and create a new Python file for your DAG, for example, sample_dag.py
.
Here’s the code for a simple DAG:
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
# Define a simple Python function for the tasks
def print_date():
from datetime import datetime
print(f"Current date and time: {datetime.now()}")
def sleep_task():
import time
time.sleep(5)
print("Sleep task completed!")
# Default arguments for the DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Define the DAG
with DAG(
dag_id='sample_dag',
default_args=default_args,
description='A simple example DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(1), # Start 1 day ago
catchup=False, # Skip catching up for missed runs
tags=['example'],
) as dag:
# Task 1: Print the current date and time
task_print_date = PythonOperator(
task_id='print_date',
python_callable=print_date,
)
# Task 2: Sleep for 5 seconds
task_sleep = PythonOperator(
task_id='sleep_task',
python_callable=sleep_task,
)
# Define task dependencies
task_print_date >> task_sleep
4.2 Place the DAG File in the Correct Directory
Ensure the Python file for the DAG is placed in the /airflow/dags
folder. You can verify the exact location by checking the dags_folder
value in the airflow.cfg
file.
Attaching a snapshot of a part of my airflow.cfg
file for reference.
4.3 Understand the DAG Code
PythonOperator: This operator allows you to execute a Python function as a task within the DAG.
Dependencies: The line
task_print_date >> task_sleep
defines the order of execution, ensuring thattask_print_date
runs beforetask_sleep
.
Step 5: Trigger the DAG
After you’ve created the DAG, return to the Airflow UI, refresh the page, and you should see the new DAG listed under the "DAGs" section. You may need to unpause the DAG for it to run. To unpause:
Click on the toggle button to unpause the DAG.
After unpausing, Airflow will automatically trigger the DAG as per its schedule. You can also manually trigger it from the UI by clicking the play button.
Step 6: Monitor Your DAG
Once the DAG runs, you can monitor its execution in the Airflow UI. You'll see logs for each task, which helps in debugging or checking the progress of each task in the DAG.
Conclusion
In this blog, we've learned how to set up Apache Airflow locally using WSL, create a simple DAG with Python tasks, and monitor it through the Airflow UI. WSL provides an excellent way to run Airflow on Windows, and with just a few commands, you can automate complex workflows easily.
Stay tuned for upcoming blogs where I’ll show how to set up Airflow using Docker and discuss more advanced Airflow features, and please let me know if you’re facing any issues with the setup :)
Join the DevHub community for more such articles, free resources, job opportunities, and much more!