Airflow APIs and Reliability

Jed Rembold

June 18, 2025

Announcements

Did Milestone 1 dashboards get done last night? If not, you have until tonight!
End of next Tuesday (June 24)
- Milestone 2 dashboards due
- Documentation due
- Group swap happens after this
Weekly reflections: through Canvas discussions

Tonight

Reliability checking in Airflow
API connections
Introducing a new data source
Work time

Reliability

Building for Resilience

So you have yourself a working pipeline: nice!
- But a pipeline that runs correctly a single time is of exceptionally limited use
How can be build resilient pipelines?
- If a network error occurs, we don’t skip entire DAGs
- If some data inconsistency occurs, we don’t pollute our warehouse
- If a task takes longer than expected to run, we raise alerts

Network Collisions

The computer networks that your data pipelines traverse are very busy places
Despite modern networking standards, sometimes a signal is lost
How can we account for that?

Do or do not? - Wait there IS a retry

There are several settings you might want to add to individual Airflow tasks:

Setting	Description	Type
`retries`	How many times to retry if something fails	`int`
`retry_delay`	How long to wait until trying again	`duration`
`retry_exponential_backoff`	Should each retry wait exponentially longer?	`bool`
`max_retry_delay`	Max delay between retries (caps the exponential backoff)	`duration`

Retry examples

The settings all are placed in () after the @task decorator

@task(
  retries = 3,
  retry_delay = pendulum.duration(minutes=1),
  retry_exponential_backoff = True
)
def my_tricky_task():
  |||Task logic and code|||

Common Task arguments

If you want to do this for many tasks that make up a DAG, this can get tedious
You can always define a dictionary of common dag arguments once, and then reference it in the dag creation
This will cause each task to utilize those defaults

from pendulum import duration, datetime

defaults = {
  'retries': 3,
  'retry_delay': duration(minutes=1),
  'retry_exponential_backoff': True,
  'max_retry_delay': duration(hours=1)
}

@dag(
  start_date = datetime(2025, 6, 18),
  schedule = '0 0 * * *',
  default_args = defaults
)
def my_great_dag():
  |||All my tasks which will inherit the defaults|||

Go to Timeout

Sometimes you just want to fail a task that has taken too long
- This is what would occur anyway after the number of retries was taken
Can add an execution_timeout=|||some duration||| parameter
There is a concept of SLA monitoring that would just send an alert instead of failing, but it is currently deactivated in Airflow 3.0
- Will be coming back in Airflow 3.1

Speaking of Monitoring

Airflow has a variety of built in ways that it can alert you of various events:
- When a task runs successfully (on_success_callback)
- When a task fails (on_failure_callback)
- When a task retries (on_retry_callback)
- When a DAG completes successfully (on_success_callback)
- When a DAG fails (on_failure_callback)
For each of these, you can specify the hook in your task/dag parameters that links to a callback function

The Callbacks

The callback function is just a Python function that performs some action when this event occurs
Possibilities could include:
- Just printing a message
- Sending a slack or Discord message (via webhook)
- Sending an email
- Pushing to some external monitoring system
Any alert callback function gets the context passed into it as the only parameter

Discord

I am working on adding a function to the utils.py file that would enable Discord reporting to your private group channel on Discord
It is not quite done yet (I need an in-between layer to account for the fact that you’ll be rotating projects), but I hope to have it done by the end of the week

from utils import discord_alert

@dag(
  ...,
  on_failure_callback = discord_alert
)
def my_broken_dag():
  |||All the tasks|||

Email

The cumbersome part of email alerts is configuring the email that Airflow will use to send from
- Requires editing your airflow.cfg (right)
Can use send_email in custom callback
Alternatively, set email, email_on_failure, email_on_retry settings in default args

[email]
email_backend = airflow.utils.email.send_email_smtp

[smtp]
smtp_host = smtp.gmail.com
smtp_starttls = True
smtp_ssl = False
smtp_user = your_email@gmail.com
smtp_password = your_app_password  # NOT your Gmail password!
smtp_port = 587
smtp_mail_from = your_email@gmail.com

Data Checks

The earlier content builds resiliency in the pipeline itself, but what about in the data moving through the pipeline?
Older provided operators include:
- SQLCheckOperator
- SQLValueCheckOperator
- SQLThresholdCheckOperator
- SQLIntervalCheckOperator
These are not used as specifically in modern use, but they highlight what things you might want to be thinking about.

Don’t Overlook!

You might feel like constraints in your data warehouse and data sources are already accounting for much of this
- They partly are, but there are some big “holes”
The instant you read in data from the database, you have lost all the constraints
- This is as true for DuckDB as it is for SQLAlchemy, Psycopg2, or any other engines
So any SQL manipulations you do within the DAG itself are not protected by constraints
Some mistakes that you could make might be flagged when you try to write into the warehouse with its own constraints, but many might not be!

Patterns to Check For

Some common items to check for include:
- Ensuring that row counts are not 0 (unless you are specifically allowing for that)
- Ensuring you have no NULL values in important columns
- Ensuring you have no duplicates in a column that is supposed to be a primary key
- Ensuring that values are within an expected range
- Checking metrics against previous metrics if expecting monotonic growth/decay

Data Checking Tips

You usually should do these checks immediately after reading in your data or making any large transformations
If already passing the read data around, it would probably make sense to separate into their own task
As a general rule, do not open a connection in one task and then try to pass around that connection object to other tasks.
- Tasks might be run on completely different workers/systems
- Connections can get closed when one worker finishes, leaving the next with a dead connection
Making these checks can feel like extra work (and it is) up until the point where they save you a TON of cleanup work

Break Time!

API Connections

Application Programming Interfaces

Everyone has worked with APIs to some degree at this point
Generally you access some particular URL, and instead of getting back a webpage, you get organized data (usually in a JSON format)
Being able to connect to various API endpoints is incredibly valuable in data pipelines, where many potential data sources may offer their information at a particular endpoint.
So how can we work with these in Airflow?

The Simple Case

http_task = HttpOperator(
    task_id="get_example_data",
    method="GET",
    endpoint="api/data",
    http_conn_id="my_api_conn",
    headers={
      "Accept": "application/json"
    },
    response_filter= (
      lambda response: response.json()
    ),
    log_response=True
)

There is an operator just called HTTPOperator that you can utilize
A bit cumbersome to get the acquired data into another task, as you must use the XCom method
```
context['ti'].xcom_pull(
  task_ids="get_example_data"
)
```
Probably only reasonable for very simple API calls (especially calls that don’t return anything)

The Python Approach

When you want more control (which will be most of the time), you will probably want to use Python’s requests library

import requests

def fetch_users():
  url = "https://example.com/api/users"
  resp = requests.get(url)

  if resp.status_code != 200:
    raise Exception(
      f"Request failed: {resp.status_code} - {resp.text}"
    )

  return resp.json()

Headers or Query Parameters

Some APIs may require special header information or you may want to provide extra query parameters

The requests library makes this very straightforward

headers = {
  "Authorization": f"Bearer {API_KEY}"
}
params = {
  "page": 2,
  "sort": "desc"
}
resp = requests.get(|||url|||, params=params, headers=headers)

Requests will take care of handling all the parameter syntax for you!

Working with JSON

Once you have your JSON response, you are probably going to want some method to parse it into tables
Your options generally look like:
- Parse it manually, looping over keys/entries
- Use Pandas: pd.DataFrame.from_dict(|||your json|||)
- Write to to file, then use DuckDB
```
with open('temp.json', 'w') as fh:
  json.dump(|||your json|||, fh)

db = duckdb.connect()
out = db.sql('SELECT * FROM 'temp.json')
out.show()
```

Pagination

Often for data pipelines you might be accessing LOTS of information from an API
You almost never can get everything at once
- This makes the API more brittle and prone to timeout problems
Endpoints that return a lot of data will generally break it up into pages
Parameter endpoints often let you specify what page you want
- May also be able to specify the desired page size (up to some max)
Requires making multiple API calls to gather all of the data that you want

When are we done?

Could just keep calling on new pages until you get an error, but it might not be clear what caused the error
- Was it because that page doesn’t exist or did something else go wrong?
Most paginated endpoints will return some metadata in each response about what page was selected, and how many pages exist
- Can use that information to dynamically loop over the necessary number of pages

Pagination Example

def get_all_pages(api_url, page_size):
  all_results = []
  current_page = 1
  total_pages = None

  while total_pages is None or current_page <= total_pages:
    params = {
      'page': current_page,
      'page_size': page_size
    }
    try:
      resp = requests.get(url, params=params)
      resp.raise_for_status()
    except Exception as e:
      raise RuntimeError(f"API request failed on page {current_page}")

    payload = resp.json()
    data = payload.get("data", [])
    total_pages = payload.get("total_pages")
    all_results.extend(data)
    current_page += 1

  return all_results

Failing APIs

What if you have an API call fail?
If it was just a single call, this could be handled with Airflow’s own retry logic
But what if it fails on page 63 out of 100?
Approaches:
- You could handle this manually by wrapping each try/except block in a for loop
- A better, more modern approach (which also closely mimics Airflows retry mechanism) is to use the tenacity library

Coding with Tenacity

Tenacity is a library that lets you decorate any function and specify similar retry parameters as Airflow has
- retry attempts, delay time, exponential backoff, etc.

from tenacity import retry, stop_after_attempt
from tenacity import wait_exponential, retry_if_exception

@retry(
  stop = stop_after_attempt(5),
  wait = wait_exponential(multiplier=2, min=1, max=60),
  retry = retry_if_exception(Exception),
)
def fetch_page(url, params):
  resp = requests.get(url, params=params)
  resp.raise_for_status()
  print(resp.json())

Introducing Fares!

This all ain’t free

On any given metro system, riders will be paying some price to ride
While we may be assuming riders here prepay on some sort of card that they can tap in and out with, there can still be different “levels” of fare
In our simulated metro system, there are currently 4 tiers:
- Student
- Standard
- Senior
- Weekly

A New Source

Fare information would frequently be handled separately from other metro logic, and so it makes sense that this might not be offered up in the same database
Instead, you can access the current fare model each individual subscribes to by accessing a particular API
Two endpoints to remember:
- sources.advde:8090/api/docs gets you all the documentation for the API
- sources.advde:8090/api/|||day|||/fares gets you the subscribed fares on a given day
Historical information is available

Work Time

Your Time!

The rest of the evening is yours to meet with your group and ensure you are getting a good start on this week’s Milestone
If you are still needing to finish up something from last week, prioritize that