--- title: "Airflow APIs and Reliability" author: Jed Rembold date: June 18, 2025 slideNumber: true theme: catppuccin highlightjs-theme: catppuccin-mocha width: 1920 height: 1080 transition: slide --- ## Announcements :::{style='font-size:.9em'} - Did Milestone 1 dashboards get done last night? If not, you have until tonight! - End of next Tuesday (June 24) - Milestone 2 dashboards due - Documentation due - Group swap happens after this - Weekly reflections: through Canvas discussions ::: ## Tonight - Reliability checking in Airflow - API connections - Introducing a new data source - Work time # Reliability ## Building for Resilience - So you have yourself a working pipeline: nice! - But a pipeline that runs correctly a single time is of exceptionally limited use - How can be build resilient pipelines? - If a network error occurs, we don't skip entire DAGs - If some data inconsistency occurs, we don't pollute our warehouse - If a task takes longer than expected to run, we raise alerts ## Network Collisions ::::::{.cols style='align-items: center'} ::::col - The computer networks that your data pipelines traverse are very busy places - Despite modern networking standards, sometimes a signal is lost - How can we account for that? :::: ::::col :::: :::::: ## Do or do not? - Wait there **IS** a retry - There are several settings you might want to add to individual Airflow _tasks_: :::{style='font-size:.9em'} | Setting | Description | Type | |-----------------------------|----------------------------------------------------------|------------| | `retries` | How many times to retry if something fails | `int` | | `retry_delay` | How long to wait until trying again | `duration` | | `retry_exponential_backoff` | Should each retry wait exponentially longer? | `bool` | | `max_retry_delay` | Max delay between retries (caps the exponential backoff) | `duration` | ::: ## Retry examples - The settings all are placed in `()` after the `@task` decorator ```python @task( retries = 3, retry_delay = pendulum.duration(minutes=1), retry_exponential_backoff = True ) def my_tricky_task(): |||Task logic and code||| ``` ## Common Task arguments ::::::{.cols style='align-items: center'} ::::col - If you want to do this for many tasks that make up a DAG, this can get tedious - You can always define a dictionary of common dag arguments _once_, and then reference it in the dag creation - This will cause each task to utilize those defaults :::: ::::{.col style='flex-grow:1.6'} ```{.python style='font-size:.8em; max-height:800px'} from pendulum import duration, datetime defaults = { 'retries': 3, 'retry_delay': duration(minutes=1), 'retry_exponential_backoff': True, 'max_retry_delay': duration(hours=1) } @dag( start_date = datetime(2025, 6, 18), schedule = '0 0 * * *', default_args = defaults ) def my_great_dag(): |||All my tasks which will inherit the defaults||| ``` :::: :::::: ## Go to Timeout - Sometimes you just want to fail a task that has taken too long - This is what would occur anyway after the number of retries was taken - Can add an `execution_timeout=|||some duration|||` parameter - There is a concept of SLA monitoring that would just send an alert instead of failing, but it is currently deactivated in Airflow 3.0 - Will be coming back in Airflow 3.1 ## Speaking of Monitoring - Airflow has a variety of built in ways that it can alert you of various events: - When a task runs successfully (`on_success_callback`) - When a task fails (`on_failure_callback`) - When a task retries (`on_retry_callback`) - When a DAG completes successfully (`on_success_callback`) - When a DAG fails (`on_failure_callback`) - For each of these, you can specify the hook in your task/dag parameters that links to a callback function ## The Callbacks - The callback function is just a Python function that performs some action when this event occurs - Possibilities could include: - Just printing a message - Sending a slack or Discord message (via webhook) - Sending an email - Pushing to some external monitoring system - Any alert callback function gets the `context` passed into it as the only parameter ## Discord - I am working on adding a function to the `utils.py` file that would enable Discord reporting to your private group channel on Discord - It is not quite done yet (I need an in-between layer to account for the fact that you'll be rotating projects), but I hope to have it done by the end of the week ```python from utils import discord_alert @dag( ..., on_failure_callback = discord_alert ) def my_broken_dag(): |||All the tasks||| ``` ## Email ::::::{.cols style='align-items: center'} ::::col - The cumbersome part of email alerts is configuring the email that Airflow will use to send _from_ - Requires editing your `airflow.cfg` (right) - Can use `send_email` in custom callback - Alternatively, set `email`, `email_on_failure`, `email_on_retry` settings in default args :::: ::::col ```{.ini style='font-size:.65em; max-height:800px;'} [email] email_backend = airflow.utils.email.send_email_smtp [smtp] smtp_host = smtp.gmail.com smtp_starttls = True smtp_ssl = False smtp_user = your_email@gmail.com smtp_password = your_app_password # NOT your Gmail password! smtp_port = 587 smtp_mail_from = your_email@gmail.com ``` :::: :::::: ## Data Checks - The earlier content builds resiliency in the pipeline itself, but what about in the data moving through the pipeline? - Older provided operators include: - `SQLCheckOperator` - `SQLValueCheckOperator` - `SQLThresholdCheckOperator` - `SQLIntervalCheckOperator` - These are not used as specifically in modern use, but they highlight what things you might want to be thinking about. ## Don't Overlook! - You might feel like constraints in your data warehouse and data sources are already accounting for much of this - They partly are, but there are some big "holes" - The instant you read in data from the database, **you have lost all the constraints** - This is as true for DuckDB as it is for SQLAlchemy, Psycopg2, or any other engines - So any SQL manipulations you do within the DAG itself are **not** protected by constraints - Some mistakes that you could make might be flagged when you try to write into the warehouse with its own constraints, but many might not be! ## Patterns to Check For - Some common items to check for include: - Ensuring that row counts are not 0 (unless you are specifically allowing for that) - Ensuring you have no `NULL` values in important columns - Ensuring you have no duplicates in a column that is supposed to be a primary key - Ensuring that values are within an expected range - Checking metrics against previous metrics if expecting monotonic growth/decay ## Data Checking Tips - You usually should do these checks immediately after reading in your data or making any large transformations - If already passing the read data around, it would probably make sense to separate into their own task - As a general rule, do **not** open a connection in one task and then try to pass around that connection object to other tasks. - Tasks might be run on completely different workers/systems - Connections can get closed when one worker finishes, leaving the next with a dead connection - Making these checks can feel like extra work (and it is) up until the point where they save you a **TON** of cleanup work # Break Time! # API Connections ## Application Programming Interfaces - Everyone has worked with APIs to some degree at this point - Generally you access some particular URL, and instead of getting back a webpage, you get organized data (usually in a JSON format) - Being able to connect to various API endpoints is incredibly valuable in data pipelines, where many potential data sources may offer their information at a particular endpoint. - So how can we work with these in Airflow? ## The Simple Case ::::::{.cols style='font-size:.9em; align-items: center;'} ::::{.col style='flex-grow:1.1'} ```{.python style='font-size:.75em'} http_task = HttpOperator( task_id="get_example_data", method="GET", endpoint="api/data", http_conn_id="my_api_conn", headers={ "Accept": "application/json" }, response_filter= ( lambda response: response.json() ), log_response=True ) ``` :::: ::::col - There is an operator just called `HTTPOperator` that you can utilize - A bit cumbersome to get the acquired data into another task, as you _must_ use the XCom method ```{.python style='font-size:.8em'} context['ti'].xcom_pull( task_ids="get_example_data" ) ``` - Probably only reasonable for very simple API calls (especially calls that don't return anything) :::: :::::: ## The Python Approach - When you want more control (which will be most of the time), you will probably want to use Python's `requests` library ```{.python style='font-size:.85em'} import requests def fetch_users(): url = "https://example.com/api/users" resp = requests.get(url) if resp.status_code != 200: raise Exception( f"Request failed: {resp.status_code} - {resp.text}" ) return resp.json() ``` ## Headers or Query Parameters - Some APIs may require special header information or you may want to provide extra query parameters - The requests library makes this very straightforward ```python headers = { "Authorization": f"Bearer {API_KEY}" } params = { "page": 2, "sort": "desc" } resp = requests.get(|||url|||, params=params, headers=headers) ``` - Requests will take care of handling all the parameter syntax for you! ## Working with JSON - Once you have your JSON response, you are probably going to want some method to parse it into tables - Your options generally look like: - Parse it manually, looping over keys/entries - Use Pandas: `pd.DataFrame.from_dict(|||your json|||)` - Write to to file, then use DuckDB ```python with open('temp.json', 'w') as fh: json.dump(|||your json|||, fh) db = duckdb.connect() out = db.sql('SELECT * FROM 'temp.json') out.show() ``` ## Pagination - Often for data pipelines you might be accessing LOTS of information from an API - You almost never can get everything at once - This makes the API more brittle and prone to timeout problems - Endpoints that return a lot of data will generally break it up into _pages_ - Parameter endpoints often let you specify what page you want - May also be able to specify the desired page size (up to some max) - Requires making multiple API calls to gather all of the data that you want ## When are we done? - Could just keep calling on new pages until you get an error, but it might not be clear what caused the error - Was it because that page doesn't exist or did something else go wrong? - Most paginated endpoints will return some metadata in each response about what page was selected, and how many pages exist - Can use that information to dynamically loop over the necessary number of pages ## Pagination Example ```{.python style='max-height:800px; white-space: pre; font-size:.9em;'} def get_all_pages(api_url, page_size): all_results = [] current_page = 1 total_pages = None while total_pages is None or current_page <= total_pages: params = { 'page': current_page, 'page_size': page_size } try: resp = requests.get(url, params=params) resp.raise_for_status() except Exception as e: raise RuntimeError(f"API request failed on page {current_page}") payload = resp.json() data = payload.get("data", []) total_pages = payload.get("total_pages") all_results.extend(data) current_page += 1 return all_results ``` ## Failing APIs - What if you have an API call fail? - If it was just a single call, this could be handled with Airflow's own retry logic - But what if it fails on page 63 out of 100? - Approaches: - You could handle this manually by wrapping each `try`/`except` block in a for loop - A better, more modern approach (which also closely mimics Airflows retry mechanism) is to use the `tenacity` library ## Coding with Tenacity - Tenacity is a library that lets you decorate any function and specify similar retry parameters as Airflow has - retry attempts, delay time, exponential backoff, etc. ```python from tenacity import retry, stop_after_attempt from tenacity import wait_exponential, retry_if_exception @retry( stop = stop_after_attempt(5), wait = wait_exponential(multiplier=2, min=1, max=60), retry = retry_if_exception(Exception), ) def fetch_page(url, params): resp = requests.get(url, params=params) resp.raise_for_status() print(resp.json()) ``` # Introducing Fares! ## This all ain't free - On any given metro system, riders will be paying some price to ride - While we may be assuming riders here prepay on some sort of card that they can tap in and out with, there can still be different "levels" of fare - In our simulated metro system, there are currently 4 tiers: - Student - Standard - Senior - Weekly ## A New Source - Fare information would frequently be handled separately from other metro logic, and so it makes sense that this might not be offered up in the same database - Instead, you can access the current fare model each individual subscribes to by accessing a particular API - Two endpoints to remember: - `sources.advde:8090/api/docs` gets you all the documentation for the API - `sources.advde:8090/api/|||day|||/fares` gets you the subscribed fares on a given day - Historical information is available # Work Time ## Your Time! - The rest of the evening is yours to meet with your group and ensure you are getting a good start on this week's Milestone - If you are still needing to finish up something from last week, prioritize that