There continue to be more tricks and tips I discover while using Airflow. I will try to collect them in some semblence of order below for your benefit!
Scheduling Intervals
Something that I initially spent a lot of time troubleshooting because it does not seem to behave exactly as the documentation indicates is utilizing scheduled runs alongside known intervals. For example, setting up a dag or task to run each day that does something specific with only the values added on that day. For these sorts of situations, there exists the data_start_interval and data_end_interval context keys that should be set to the start and end of the DAG’s scheduled window. However, I had some difficulty getting these to work as the documentation indicated. In truth, there are two scheduling TimeTable objects that Airflow can work with: CronDataIntervalTimetable and CronTriggerTimetable. The CronDataIntervalTimetable stores information about desired interval and runs at the end of the scheduled interval, the better to do batch processing on that interval’s data. In contrast, the CronTriggerTimetable stores no interval information and just contains a triggering time. According to the documentation, the basic syntax of
@dag(
start_date = pendulum.datetime(2025, 6, 6, tz = 'America/Los_Angeles'),
schedule = '0 * * * *'
)
def my_dag():
passshould create a CronDataIntervalTimetable by default, but I could not seem to get this to work. Using the above syntax never resulted in proper intervals being given to my tasks (in particular, it sets data_start_interval and data_end_interval to be the same values. I was able to get around this by explicitly using the CronDataIntervalTimetable and including a timezone, as below:
from airflow.timetables.interval import CronDataIntervalTimetable
@dag(
start_date = pendulum.datetime(2025, 6, 6, tz = 'America/Los_Angeles'),
schedule = CronDataIntervalTimetable('0 * * * *', timezone=pendulum.timezone('America/Los_Angeles'))
)
def my_dag():
passAfter doing so, my data_start_interval and data_end_interval were correctly populated and I could use them to filter the data as I desired.
I am still not clear if this a Docker issue, a timezone issue, a documentation issue, or some combination of all of the above. I am still testing, and will update this as I figure anything else out.
Testing DAGs
I never get my DAGs correct on the first try. So I have found it useful to adopt a few practices that help me troubleshoot and keep a smooth workflow between DAG writing and the Airflow UI. Some things I would suggest:
- Initially prepend the name of your DAG with
test_. This will give you a named DAG that you can test with, but once you are happy with it and ready to accept it, you can remove thetest_prefix and get a clean history in the Airflow UI. - Even if you are eventually going to want to turn
catchupon, start with it off while you are testing! - If you don’t see your DAG showing up when you make a new file, ensure that you aren’t seeing a ‘DAG Parse Failed’ little notification icon near the top of the Dags page on the Airflow UI. If you are seeing that, you can click it to get the error message of how the parse failed. Usually it is a syntax error of some sort.
- Build up the testing DAG one task at a time, testing with a manual run between each
- Keep in mind that there can be differences between scheduled runs and manual runs! If you are worried something isn’t working properly because of a manual run, you can always bump the scheduled interval higher for a brief period of time while you test things.
- When you edit a DAG, on that DAG’s page in Airflow you can click the ‘Reparse Dag’ button to speed up the process a bit in which the server detects and reparses your new version. Sometimes you might need to refresh the Airflow UI shortly after doing so to see the updated changes.
- While testing, make sure that your DAG isn’t actually doing anything irreversible! Make copies of database tables or S3 files that you want to read or write to for practice.
- Ensure at the end of the day that your DAG is idempotent: running it multiple times should always leave things in a consistent state. This usually means not using things like
pendulum.now()in your DAGs, as those will always depend on what day they happen to be run. - You can simulate running your DAG on older days by expanding the ‘Advanced Options’ section when you click ‘Trigger’, and choosing a specific Logical Date.
- Once you have everything working to your satisfaction, go back into your DAG, remove the
test_at the front (and update the function call at the bottom of the file if you need) and turn oncatchupif you need it.