Hey there! Ready to dive into the world of Apache Airflow and make your workflows run like a charm? Let’s explore how to schedule your Directed Acyclic Graphs (DAGs) effectively. We’ll cover key parameters, and tools for backfilling, and even test your knowledge with some questions. Let’s get started!
Table of Contents
Key DAG Scheduling Parameters
To kick things off, let’s break down the essential scheduling parameters. Each one plays a unique role in controlling when and how your DAG runs.
catchup
- Purpose: Imagine you’ve got a DAG that was off for a bit. Do you want it to catch up on all the missed runs? That’s what
catchup
does. - Default Value:
True
- Valid Values:
True
orFalse
- Think about it: If you don’t want your DAG to run all missed intervals when you turn it back on, what value should you set for
catchup
?
start_date
- Purpose: This is the date and time your DAG should begin its journey. It’s the starting point for all its scheduled runs.
- Default Value: None (You need to set this)
- Valid Values: A
datetime
object or a string likeYYYY-MM-DD HH:MM:SS
- Think about it: When do you want your DAG to kick off? Choose your
start_date
wisely!
end_date
- Purpose: Want your DAG to stop running after a certain date? Use
end_date
to specify that. - Default Value: None (Optional)
- Valid Values: A
datetime
object or a string likeYYYY-MM-DD HH:MM:SS
- Think about it: Do you need an end date for your DAG, or should it run indefinitely?
schedule_interval
- Purpose: This defines how often your DAG should run. You can use cron expressions, timedelta objects, or predefined presets.
- Default Value:
@daily
Valid Values:
- Cron expressions (e.g.,
0 14 * * *
for 2 PM daily) timedelta
objects (e.g.,timedelta(days=1)
)- Predefined intervals (e.g.,
@hourly
,@daily
,@weekly
) - Think about it: How frequently should your DAG run? Every hour, day, or maybe once a week?
Tools for Backfilling DAGs
Sometimes, you need to backfill your DAGs — run them for past dates. Here’s how you can do it:
- Trigger DAG Runs Manually: Use the Airflow UI or the CLI command
airflow dags trigger -e <execution_date> <dag_id>
to trigger specific dates. - CLI Commands: Use
airflow dags backfill -s <start_date> -e <end_date> <dag_id>
to backfill over a period.
Scheduling Examples
Let’s put theory into practice with some examples.
Schedule a DAG Every Day at 2 PM
Parameters:
start_date
:datetime(2024, 7, 1, 14, 0, 0)
(1st July 2024, 2:00 PM)schedule_interval
:0 14 * * *
(Cron expression for 2 PM daily)catchup
:False
(No backfilling missed intervals)
Here’s the code for that:
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime
default_args = {
'start_date': datetime(2024, 7, 1, 14, 0, 0),
'catchup': False,
}
dag = DAG(
'daily_2pm_dag',
default_args=default_args,
schedule_interval='0 14 * * *'
)
start = DummyOperator(task_id='start', dag=dag)
This code sets up a DAG that starts on July 1, 2024, at 2:00 PM and runs daily at 2:00 PM. The catchup
parameter is set to False
, so it won’t backfill missed intervals.
Validate a Specific Scheduling Goal
Parameters:
start_date
:datetime(2024, 1, 1, 0, 0, 0)
(1st January 2024, midnight)schedule_interval
:@hourly
catchup
:True
- Goal: Schedule the DAG to run hourly starting from 1st January 2024.
Here’s the code for that:
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from datetime import datetime
default_args = {
'start_date': datetime(2024, 1, 1, 0, 0, 0),
'catchup': True,
}
dag = DAG(
'hourly_dag',
default_args=default_args,
schedule_interval='@hourly'
)
start = DummyOperator(task_id='start', dag=dag)
This code sets up a DAG that starts on January 1, 2024, at midnight and runs hourly. With catchup
set to True
, the DAG will execute for all missed intervals if it was paused and then resumed.
Test Your Understanding
Let’s see how much you’ve learned! Try answering these questions.
- What is the default value for the
catchup
parameter?
- A) True
- B) False
2. Which parameter do you use to define the starting point of a DAG’s schedule?
- A) schedule_interval
- B) start_date
- C) end_date
3. How do you schedule a DAG to run every day at 2 PM?
- A)
schedule_interval='@daily'
- B)
schedule_interval='0 14 * * *'
- C)
schedule_interval='@hourly'
4. If you want your DAG to stop running on 31st December 2024, which parameter will you set?
- A) start_date
- B) end_date
- C) schedule_interval
5. You can use airflow dags trigger -e <execution_date> <dag_id>
to backfill DAG runs.
- A) True
- B) False
Answers
Check your answers below:
- A) True
- B) start_date
- B)
schedule_interval='0 14 * * *'
- B) end_date
- True
How did you do? Understanding these concepts is crucial for mastering Airflow and keeping your workflows running smoothly. Keep experimenting with different scheduling parameters, and soon you’ll be an Airflow pro!
For more information check out the Apache Airflow Document, It’s recommended to read this blog for deep understanding.