---
title: "Airflow APIs and Reliability"
author: Jed Rembold
date: June 18, 2025
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
:::{style='font-size:.9em'}
- Did Milestone 1 dashboards get done last night? If not, you have until tonight!
- End of next Tuesday (June 24)
  - Milestone 2 dashboards due
  - Documentation due
  - Group swap happens after this
- Weekly reflections: through Canvas discussions
:::


## Tonight
- Reliability checking in Airflow
- API connections
- Introducing a new data source
- Work time

# Reliability

## Building for Resilience
- So you have yourself a working pipeline: nice!
  - But a pipeline that runs correctly a single time is of exceptionally limited use
- How can be build resilient pipelines?
  - If a network error occurs, we don't skip entire DAGs
  - If some data inconsistency occurs, we don't pollute our warehouse
  - If a task takes longer than expected to run, we raise alerts


## Network Collisions
::::::{.cols style='align-items: center'}
::::col
- The computer networks that your data pipelines traverse are very busy places
- Despite modern networking standards, sometimes a signal is lost
- How can we account for that?
::::

::::col
<iframe width="560" height="315" src="https://videopress.com/embed/aNyoaaBs?loop=1&autoPlay=1" frameborder="0" allowfullscreen allow="clipboard-write"></iframe><script src="https://videopress.com/videopress-iframe.js"></script>
::::
::::::

## Do or do not? - Wait there **IS** a retry
- There are several settings you might want to add to individual Airflow _tasks_:

:::{style='font-size:.9em'}

| Setting                     | Description                                              | Type       |
|-----------------------------|----------------------------------------------------------|------------|
| `retries`                   | How many times to retry if something fails               | `int`      |
| `retry_delay`               | How long to wait until trying again                      | `duration` |
| `retry_exponential_backoff` | Should each retry wait exponentially longer?             | `bool`     |
| `max_retry_delay`           | Max delay between retries (caps the exponential backoff) | `duration` |

:::

## Retry examples
- The settings all are placed in `()` after the `@task` decorator

```python
@task(
  retries = 3,
  retry_delay = pendulum.duration(minutes=1),
  retry_exponential_backoff = True
)
def my_tricky_task():
  |||Task logic and code|||
```

## Common Task arguments
::::::{.cols style='align-items: center'}
::::col
- If you want to do this for many tasks that make up a DAG, this can get tedious
- You can always define a dictionary of common dag arguments _once_, and then reference it in the dag creation
- This will cause each task to utilize those defaults
::::

::::{.col style='flex-grow:1.6'}
```{.python style='font-size:.8em; max-height:800px'}
from pendulum import duration, datetime

defaults = {
  'retries': 3,
  'retry_delay': duration(minutes=1),
  'retry_exponential_backoff': True,
  'max_retry_delay': duration(hours=1)
}

@dag(
  start_date = datetime(2025, 6, 18),
  schedule = '0 0 * * *',
  default_args = defaults
)
def my_great_dag():
  |||All my tasks which will inherit the defaults|||
```

::::
::::::

## Go to Timeout
- Sometimes you just want to fail a task that has taken too long
  - This is what would occur anyway after the number of retries was taken
- Can add an `execution_timeout=|||some duration|||` parameter
- There is a concept of SLA monitoring that would just send an alert instead of failing, but it is currently deactivated in Airflow 3.0
  - Will be coming back in Airflow 3.1


## Speaking of Monitoring
- Airflow has a variety of built in ways that it can alert you of various events:
  - When a task runs successfully (`on_success_callback`)
  - When a task fails (`on_failure_callback`)
  - When a task retries (`on_retry_callback`)
  - When a DAG completes successfully (`on_success_callback`)
  - When a DAG fails (`on_failure_callback`)
- For each of these, you can specify the hook in your task/dag parameters that links to a callback function

## The Callbacks
- The callback function is just a Python function that performs some action when this event occurs
- Possibilities could include:
  - Just printing a message
  - Sending a slack or Discord message (via webhook)
  - Sending an email
  - Pushing to some external monitoring system
- Any alert callback function gets the `context` passed into it as the only parameter

## Discord
- I am working on adding a function to the `utils.py` file that would enable Discord reporting to your private group channel on Discord
- It is not quite done yet (I need an in-between layer to account for the fact that you'll be rotating projects), but I hope to have it done by the end of the week

```python
from utils import discord_alert

@dag(
  ...,
  on_failure_callback = discord_alert
)
def my_broken_dag():
  |||All the tasks|||
```

## Email
::::::{.cols style='align-items: center'}
::::col
- The cumbersome part of email alerts is configuring the email that Airflow will use to send _from_
  - Requires editing your `airflow.cfg` (right)
- Can use `send_email` in custom callback
- Alternatively, set `email`, `email_on_failure`, `email_on_retry` settings in default args
::::

::::col
```{.ini style='font-size:.65em; max-height:800px;'}
[email]
email_backend = airflow.utils.email.send_email_smtp

[smtp]
smtp_host = smtp.gmail.com
smtp_starttls = True
smtp_ssl = False
smtp_user = your_email@gmail.com
smtp_password = your_app_password  # NOT your Gmail password!
smtp_port = 587
smtp_mail_from = your_email@gmail.com
```

::::
::::::

## Data Checks
- The earlier content builds resiliency in the pipeline itself, but what about in the data moving through the pipeline?
- Older provided operators include:
  - `SQLCheckOperator`
  - `SQLValueCheckOperator`
  - `SQLThresholdCheckOperator`
  - `SQLIntervalCheckOperator`
- These are not used as specifically in modern use, but they highlight what things you might want to be thinking about.


## Don't Overlook!
- You might feel like constraints in your data warehouse and data sources are already accounting for much of this
  - They partly are, but there are some big "holes"
- The instant you read in data from the database, **you have lost all the constraints**
  - This is as true for DuckDB as it is for SQLAlchemy, Psycopg2, or any other engines
- So any SQL manipulations you do within the DAG itself are **not** protected by constraints
- Some mistakes that you could make might be flagged when you try to write into the warehouse with its own constraints, but many might not be!


## Patterns to Check For
- Some common items to check for include:
  - Ensuring that row counts are not 0 (unless you are specifically allowing for that)
  - Ensuring you have no `NULL` values in important columns
  - Ensuring you have no duplicates in a column that is supposed to be a primary key
  - Ensuring that values are within an expected range
  - Checking metrics against previous metrics if expecting monotonic growth/decay


## Data Checking Tips
- You usually should do these checks immediately after reading in your data or making any large transformations
- If already passing the read data around, it would probably make sense to separate into their own task
- As a general rule, do **not** open a connection in one task and then try to pass around that connection object to other tasks.
  - Tasks might be run on completely different workers/systems
  - Connections can get closed when one worker finishes, leaving the next with a dead connection
- Making these checks can feel like extra work (and it is) up until the point where they save you a **TON** of cleanup work

# Break Time!

# API Connections

## Application Programming Interfaces
- Everyone has worked with APIs to some degree at this point
- Generally you access some particular URL, and instead of getting back a webpage, you get organized data (usually in a JSON format)
- Being able to connect to various API endpoints is incredibly valuable in data pipelines, where many potential data sources may offer their information at a particular endpoint.
- So how can we work with these in Airflow?


## The Simple Case
::::::{.cols style='font-size:.9em; align-items: center;'}
::::{.col style='flex-grow:1.1'}
```{.python style='font-size:.75em'}
http_task = HttpOperator(
    task_id="get_example_data",
    method="GET",
    endpoint="api/data",
    http_conn_id="my_api_conn",
    headers={
      "Accept": "application/json"
    },
    response_filter= (
      lambda response: response.json()
    ),
    log_response=True
)
```
::::

::::col
- There is an operator just called `HTTPOperator` that you can utilize
- A bit cumbersome to get the acquired data into another task, as you _must_ use the XCom method
  ```{.python style='font-size:.8em'}
  context['ti'].xcom_pull(
    task_ids="get_example_data"
  )
  ```
- Probably only reasonable for very simple API calls (especially calls that don't return anything)

::::
::::::

## The Python Approach
- When you want more control (which will be most of the time), you will probably want to use Python's `requests` library
  ```{.python style='font-size:.85em'}
  import requests

  def fetch_users():
    url = "https://example.com/api/users"
    resp = requests.get(url)

    if resp.status_code != 200:
      raise Exception(
        f"Request failed: {resp.status_code} - {resp.text}"
      )

    return resp.json()
  ```

## Headers or Query Parameters
- Some APIs may require special header information or you may want to provide extra query parameters
- The requests library makes this very straightforward
  ```python
  headers = {
    "Authorization": f"Bearer {API_KEY}"
  }
  params = {
    "page": 2,
    "sort": "desc"
  }
  resp = requests.get(|||url|||, params=params, headers=headers)
  ```
- Requests will take care of handling all the parameter syntax for you!


## Working with JSON
- Once you have your JSON response, you are probably going to want some method to parse it into tables
- Your options generally look like:
  - Parse it manually, looping over keys/entries
  - Use Pandas: `pd.DataFrame.from_dict(|||your json|||)`
  - Write to to file, then use DuckDB
    ```python
    with open('temp.json', 'w') as fh:
      json.dump(|||your json|||, fh)

    db = duckdb.connect()
    out = db.sql('SELECT * FROM 'temp.json')
    out.show()
    ```

## Pagination
- Often for data pipelines you might be accessing LOTS of information from an API
- You almost never can get everything at once
  - This makes the API more brittle and prone to timeout problems
- Endpoints that return a lot of data will generally break it up into _pages_
- Parameter endpoints often let you specify what page you want
  - May also be able to specify the desired page size (up to some max)
- Requires making multiple API calls to gather all of the data that you want

## When are we done?
- Could just keep calling on new pages until you get an error, but it might not be clear what caused the error
  - Was it because that page doesn't exist or did something else go wrong?
- Most paginated endpoints will return some metadata in each response about what page was selected, and how many pages exist
  - Can use that information to dynamically loop over the necessary number of pages


## Pagination Example

```{.python style='max-height:800px; white-space: pre; font-size:.9em;'}
def get_all_pages(api_url, page_size):
  all_results = []
  current_page = 1
  total_pages = None

  while total_pages is None or current_page <= total_pages:
    params = {
      'page': current_page,
      'page_size': page_size
    }
    try:
      resp = requests.get(url, params=params)
      resp.raise_for_status()
    except Exception as e:
      raise RuntimeError(f"API request failed on page {current_page}")

    payload = resp.json()
    data = payload.get("data", [])
    total_pages = payload.get("total_pages")
    all_results.extend(data)
    current_page += 1

  return all_results
```

## Failing APIs
- What if you have an API call fail?
- If it was just a single call, this could be handled with Airflow's own retry logic
- But what if it fails on page 63 out of 100?
- Approaches:
  - You could handle this manually by wrapping each `try`/`except` block in a for loop
  - A better, more modern approach (which also closely mimics Airflows retry mechanism) is to use the `tenacity` library

## Coding with Tenacity
- Tenacity is a library that lets you decorate any function and specify similar retry parameters as Airflow has
  - retry attempts, delay time, exponential backoff, etc.

```python
from tenacity import retry, stop_after_attempt
from tenacity import wait_exponential, retry_if_exception

@retry(
  stop = stop_after_attempt(5),
  wait = wait_exponential(multiplier=2, min=1, max=60),
  retry = retry_if_exception(Exception),
)
def fetch_page(url, params):
  resp = requests.get(url, params=params)
  resp.raise_for_status()
  print(resp.json())
```

# Introducing Fares!

## This all ain't free
- On any given metro system, riders will be paying some price to ride
- While we may be assuming riders here prepay on some sort of card that they can tap in and out with, there can still be different "levels" of fare
- In our simulated metro system, there are currently 4 tiers:
  - Student
  - Standard
  - Senior
  - Weekly


## A New Source
- Fare information would frequently be handled separately from other metro logic, and so it makes sense that this might not be offered up in the same database
- Instead, you can access the current fare model each individual subscribes to by accessing a particular API
- Two endpoints to remember:
  - `sources.advde:8090/api/docs` gets you all the documentation for the API
  - `sources.advde:8090/api/|||day|||/fares` gets you the subscribed fares on a given day
- Historical information is available


# Work Time

## Your Time!
- The rest of the evening is yours to meet with your group and ensure you are getting a good start on this week's Milestone
- If you are still needing to finish up something from last week, prioritize that
