Understanding DAG Scheduling in Apache Airflow

DAG Scheduling in Apache Airflow

Hey there! Ready to dive into the world of Apache Airflow and make your workflows run like a charm? Let’s explore how to schedule your Directed Acyclic Graphs (DAGs) effectively. We’ll cover key parameters, and tools for backfilling, and even test your knowledge with some questions. Let’s get started!

Key DAG Scheduling Parameters

To kick things off, let’s break down the essential scheduling parameters. Each one plays a unique role in controlling when and how your DAG runs.

catchup

  • Purpose: Imagine you’ve got a DAG that was off for a bit. Do you want it to catch up on all the missed runs? That’s what catchup does.
  • Default Value: True
  • Valid Values: True or False
  • Think about it: If you don’t want your DAG to run all missed intervals when you turn it back on, what value should you set for catchup?

start_date

  • Purpose: This is the date and time your DAG should begin its journey. It’s the starting point for all its scheduled runs.
  • Default Value: None (You need to set this)
  • Valid Values: A datetime object or a string like YYYY-MM-DD HH:MM:SS
  • Think about it: When do you want your DAG to kick off? Choose your start_date wisely!

end_date

  • Purpose: Want your DAG to stop running after a certain date? Use end_date to specify that.
  • Default Value: None (Optional)
  • Valid Values: A datetime object or a string like YYYY-MM-DD HH:MM:SS
  • Think about it: Do you need an end date for your DAG, or should it run indefinitely?

schedule_interval

  • Purpose: This defines how often your DAG should run. You can use cron expressions, timedelta objects, or predefined presets.
  • Default Value: @daily

Valid Values:

  • Cron expressions (e.g., 0 14 * * * for 2 PM daily)
  • timedelta objects (e.g., timedelta(days=1))
  • Predefined intervals (e.g., @hourly, @daily, @weekly)
  • Think about it: How frequently should your DAG run? Every hour, day, or maybe once a week?

Tools for Backfilling DAGs

Sometimes, you need to backfill your DAGs — run them for past dates. Here’s how you can do it:

  • Trigger DAG Runs Manually: Use the Airflow UI or the CLI command airflow dags trigger -e <execution_date> <dag_id> to trigger specific dates.
  • CLI Commands: Use airflow dags backfill -s <start_date> -e <end_date> <dag_id> to backfill over a period.

Scheduling Examples

Let’s put theory into practice with some examples.

Schedule a DAG Every Day at 2 PM

Parameters:

  • start_date: datetime(2024, 7, 1, 14, 0, 0) (1st July 2024, 2:00 PM)
  • schedule_interval: 0 14 * * * (Cron expression for 2 PM daily)
  • catchup: False (No backfilling missed intervals)

Here’s the code for that:

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime

default_args = {
'start_date': datetime(2024, 7, 1, 14, 0, 0),
'catchup': False,
}

dag = DAG(
'daily_2pm_dag',
default_args=default_args,
schedule_interval='0 14 * * *'
)

start = DummyOperator(task_id='start', dag=dag)

This code sets up a DAG that starts on July 1, 2024, at 2:00 PM and runs daily at 2:00 PM. The catchup parameter is set to False, so it won’t backfill missed intervals.

Validate a Specific Scheduling Goal

Parameters:

  • start_date: datetime(2024, 1, 1, 0, 0, 0) (1st January 2024, midnight)
  • schedule_interval: @hourly
  • catchup: True
  • Goal: Schedule the DAG to run hourly starting from 1st January 2024.

Here’s the code for that:

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime

default_args = {
'start_date': datetime(2024, 1, 1, 0, 0, 0),
'catchup': True,
}

dag = DAG(
'hourly_dag',
default_args=default_args,
schedule_interval='@hourly'
)

start = DummyOperator(task_id='start', dag=dag)

This code sets up a DAG that starts on January 1, 2024, at midnight and runs hourly. With catchup set to True, the DAG will execute for all missed intervals if it was paused and then resumed.

Test Your Understanding

Let’s see how much you’ve learned! Try answering these questions.

  1. What is the default value for the catchup parameter?
  • A) True
  • B) False

2. Which parameter do you use to define the starting point of a DAG’s schedule?

  • A) schedule_interval
  • B) start_date
  • C) end_date

3. How do you schedule a DAG to run every day at 2 PM?

  • A) schedule_interval='@daily'
  • B) schedule_interval='0 14 * * *'
  • C) schedule_interval='@hourly'

4. If you want your DAG to stop running on 31st December 2024, which parameter will you set?

  • A) start_date
  • B) end_date
  • C) schedule_interval

5. You can use airflow dags trigger -e <execution_date> <dag_id> to backfill DAG runs.

  • A) True
  • B) False

Answers

Check your answers below:

  1. A) True
  2. B) start_date
  3. B) schedule_interval='0 14 * * *'
  4. B) end_date
  5. True

How did you do? Understanding these concepts is crucial for mastering Airflow and keeping your workflows running smoothly. Keep experimenting with different scheduling parameters, and soon you’ll be an Airflow pro!

For more information check out the Apache Airflow Document, It’s recommended to read this blog for deep understanding.

Leave a Comment

Your email address will not be published. Required fields are marked *