Airflow APIs and Reliability

Jed Rembold

June 18, 2025

Announcements

  • Did Milestone 1 dashboards get done last night? If not, you have until tonight!
  • End of next Tuesday (June 24)
    • Milestone 2 dashboards due
    • Documentation due
    • Group swap happens after this
  • Weekly reflections: through Canvas discussions

Tonight

  • Reliability checking in Airflow
  • API connections
  • Introducing a new data source
  • Work time

Reliability

Building for Resilience

  • So you have yourself a working pipeline: nice!
    • But a pipeline that runs correctly a single time is of exceptionally limited use
  • How can be build resilient pipelines?
    • If a network error occurs, we don’t skip entire DAGs
    • If some data inconsistency occurs, we don’t pollute our warehouse
    • If a task takes longer than expected to run, we raise alerts

Network Collisions

  • The computer networks that your data pipelines traverse are very busy places
  • Despite modern networking standards, sometimes a signal is lost
  • How can we account for that?

Do or do not? - Wait there IS a retry

  • There are several settings you might want to add to individual Airflow tasks:
Setting Description Type
retries How many times to retry if something fails int
retry_delay How long to wait until trying again duration
retry_exponential_backoff Should each retry wait exponentially longer? bool
max_retry_delay Max delay between retries (caps the exponential backoff) duration

Retry examples

  • The settings all are placed in () after the @task decorator
@task(
  retries = 3,
  retry_delay = pendulum.duration(minutes=1),
  retry_exponential_backoff = True
)
def my_tricky_task():
  |||Task logic and code|||

Common Task arguments

  • If you want to do this for many tasks that make up a DAG, this can get tedious
  • You can always define a dictionary of common dag arguments once, and then reference it in the dag creation
  • This will cause each task to utilize those defaults
from pendulum import duration, datetime

defaults = {
  'retries': 3,
  'retry_delay': duration(minutes=1),
  'retry_exponential_backoff': True,
  'max_retry_delay': duration(hours=1)
}

@dag(
  start_date = datetime(2025, 6, 18),
  schedule = '0 0 * * *',
  default_args = defaults
)
def my_great_dag():
  |||All my tasks which will inherit the defaults|||

Go to Timeout

  • Sometimes you just want to fail a task that has taken too long
    • This is what would occur anyway after the number of retries was taken
  • Can add an execution_timeout=|||some duration||| parameter
  • There is a concept of SLA monitoring that would just send an alert instead of failing, but it is currently deactivated in Airflow 3.0
    • Will be coming back in Airflow 3.1

Speaking of Monitoring

  • Airflow has a variety of built in ways that it can alert you of various events:
    • When a task runs successfully (on_success_callback)
    • When a task fails (on_failure_callback)
    • When a task retries (on_retry_callback)
    • When a DAG completes successfully (on_success_callback)
    • When a DAG fails (on_failure_callback)
  • For each of these, you can specify the hook in your task/dag parameters that links to a callback function

The Callbacks

  • The callback function is just a Python function that performs some action when this event occurs
  • Possibilities could include:
    • Just printing a message
    • Sending a slack or Discord message (via webhook)
    • Sending an email
    • Pushing to some external monitoring system
  • Any alert callback function gets the context passed into it as the only parameter

Discord

  • I am working on adding a function to the utils.py file that would enable Discord reporting to your private group channel on Discord
  • It is not quite done yet (I need an in-between layer to account for the fact that you’ll be rotating projects), but I hope to have it done by the end of the week
from utils import discord_alert

@dag(
  ...,
  on_failure_callback = discord_alert
)
def my_broken_dag():
  |||All the tasks|||

Email

  • The cumbersome part of email alerts is configuring the email that Airflow will use to send from
    • Requires editing your airflow.cfg (right)
  • Can use send_email in custom callback
  • Alternatively, set email, email_on_failure, email_on_retry settings in default args
[email]
email_backend = airflow.utils.email.send_email_smtp

[smtp]
smtp_host = smtp.gmail.com
smtp_starttls = True
smtp_ssl = False
smtp_user = your_email@gmail.com
smtp_password = your_app_password  # NOT your Gmail password!
smtp_port = 587
smtp_mail_from = your_email@gmail.com

Data Checks

  • The earlier content builds resiliency in the pipeline itself, but what about in the data moving through the pipeline?
  • Older provided operators include:
    • SQLCheckOperator
    • SQLValueCheckOperator
    • SQLThresholdCheckOperator
    • SQLIntervalCheckOperator
  • These are not used as specifically in modern use, but they highlight what things you might want to be thinking about.

Don’t Overlook!

  • You might feel like constraints in your data warehouse and data sources are already accounting for much of this
    • They partly are, but there are some big “holes”
  • The instant you read in data from the database, you have lost all the constraints
    • This is as true for DuckDB as it is for SQLAlchemy, Psycopg2, or any other engines
  • So any SQL manipulations you do within the DAG itself are not protected by constraints
  • Some mistakes that you could make might be flagged when you try to write into the warehouse with its own constraints, but many might not be!

Patterns to Check For

  • Some common items to check for include:
    • Ensuring that row counts are not 0 (unless you are specifically allowing for that)
    • Ensuring you have no NULL values in important columns
    • Ensuring you have no duplicates in a column that is supposed to be a primary key
    • Ensuring that values are within an expected range
    • Checking metrics against previous metrics if expecting monotonic growth/decay

Data Checking Tips

  • You usually should do these checks immediately after reading in your data or making any large transformations
  • If already passing the read data around, it would probably make sense to separate into their own task
  • As a general rule, do not open a connection in one task and then try to pass around that connection object to other tasks.
    • Tasks might be run on completely different workers/systems
    • Connections can get closed when one worker finishes, leaving the next with a dead connection
  • Making these checks can feel like extra work (and it is) up until the point where they save you a TON of cleanup work

Break Time!

API Connections

Application Programming Interfaces

  • Everyone has worked with APIs to some degree at this point
  • Generally you access some particular URL, and instead of getting back a webpage, you get organized data (usually in a JSON format)
  • Being able to connect to various API endpoints is incredibly valuable in data pipelines, where many potential data sources may offer their information at a particular endpoint.
  • So how can we work with these in Airflow?

The Simple Case

http_task = HttpOperator(
    task_id="get_example_data",
    method="GET",
    endpoint="api/data",
    http_conn_id="my_api_conn",
    headers={
      "Accept": "application/json"
    },
    response_filter= (
      lambda response: response.json()
    ),
    log_response=True
)
  • There is an operator just called HTTPOperator that you can utilize

  • A bit cumbersome to get the acquired data into another task, as you must use the XCom method

    context['ti'].xcom_pull(
      task_ids="get_example_data"
    )
  • Probably only reasonable for very simple API calls (especially calls that don’t return anything)

The Python Approach

  • When you want more control (which will be most of the time), you will probably want to use Python’s requests library

    import requests
    
    def fetch_users():
      url = "https://example.com/api/users"
      resp = requests.get(url)
    
      if resp.status_code != 200:
        raise Exception(
          f"Request failed: {resp.status_code} - {resp.text}"
        )
    
      return resp.json()

Headers or Query Parameters

  • Some APIs may require special header information or you may want to provide extra query parameters

  • The requests library makes this very straightforward

    headers = {
      "Authorization": f"Bearer {API_KEY}"
    }
    params = {
      "page": 2,
      "sort": "desc"
    }
    resp = requests.get(|||url|||, params=params, headers=headers)
  • Requests will take care of handling all the parameter syntax for you!

Working with JSON

  • Once you have your JSON response, you are probably going to want some method to parse it into tables
  • Your options generally look like:
    • Parse it manually, looping over keys/entries

    • Use Pandas: pd.DataFrame.from_dict(|||your json|||)

    • Write to to file, then use DuckDB

      with open('temp.json', 'w') as fh:
        json.dump(|||your json|||, fh)
      
      db = duckdb.connect()
      out = db.sql('SELECT * FROM 'temp.json')
      out.show()

Pagination

  • Often for data pipelines you might be accessing LOTS of information from an API
  • You almost never can get everything at once
    • This makes the API more brittle and prone to timeout problems
  • Endpoints that return a lot of data will generally break it up into pages
  • Parameter endpoints often let you specify what page you want
    • May also be able to specify the desired page size (up to some max)
  • Requires making multiple API calls to gather all of the data that you want

When are we done?

  • Could just keep calling on new pages until you get an error, but it might not be clear what caused the error
    • Was it because that page doesn’t exist or did something else go wrong?
  • Most paginated endpoints will return some metadata in each response about what page was selected, and how many pages exist
    • Can use that information to dynamically loop over the necessary number of pages

Pagination Example

def get_all_pages(api_url, page_size):
  all_results = []
  current_page = 1
  total_pages = None

  while total_pages is None or current_page <= total_pages:
    params = {
      'page': current_page,
      'page_size': page_size
    }
    try:
      resp = requests.get(url, params=params)
      resp.raise_for_status()
    except Exception as e:
      raise RuntimeError(f"API request failed on page {current_page}")

    payload = resp.json()
    data = payload.get("data", [])
    total_pages = payload.get("total_pages")
    all_results.extend(data)
    current_page += 1

  return all_results

Failing APIs

  • What if you have an API call fail?
  • If it was just a single call, this could be handled with Airflow’s own retry logic
  • But what if it fails on page 63 out of 100?
  • Approaches:
    • You could handle this manually by wrapping each try/except block in a for loop
    • A better, more modern approach (which also closely mimics Airflows retry mechanism) is to use the tenacity library

Coding with Tenacity

  • Tenacity is a library that lets you decorate any function and specify similar retry parameters as Airflow has
    • retry attempts, delay time, exponential backoff, etc.
from tenacity import retry, stop_after_attempt
from tenacity import wait_exponential, retry_if_exception

@retry(
  stop = stop_after_attempt(5),
  wait = wait_exponential(multiplier=2, min=1, max=60),
  retry = retry_if_exception(Exception),
)
def fetch_page(url, params):
  resp = requests.get(url, params=params)
  resp.raise_for_status()
  print(resp.json())

Introducing Fares!

This all ain’t free

  • On any given metro system, riders will be paying some price to ride
  • While we may be assuming riders here prepay on some sort of card that they can tap in and out with, there can still be different “levels” of fare
  • In our simulated metro system, there are currently 4 tiers:
    • Student
    • Standard
    • Senior
    • Weekly

A New Source

  • Fare information would frequently be handled separately from other metro logic, and so it makes sense that this might not be offered up in the same database
  • Instead, you can access the current fare model each individual subscribes to by accessing a particular API
  • Two endpoints to remember:
    • sources.advde:8090/api/docs gets you all the documentation for the API
    • sources.advde:8090/api/|||day|||/fares gets you the subscribed fares on a given day
  • Historical information is available

Work Time

Your Time!

  • The rest of the evening is yours to meet with your group and ensure you are getting a good start on this week’s Milestone
  • If you are still needing to finish up something from last week, prioritize that
// reveal.js plugins // Added plugins