---
title: "Airflow and Snapshots"
author: Jed Rembold
date: May 26/27, 2026
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
:::{style='font-size:.8em'}
- VMs are live!
  - You should have gotten a whole host of connection details in an email
  - Some needed to connect to the VM in the first place
  - Some needed to connect to other pieces of software within the network
- For next week:
  - Check in with your group!
    - How do you like to communicate? There are Discord channels for each group should you want to use them.
    - These first few weeks the amount of work isn't going to be as large, but is important for everyone to understand. How do you want to accomplish that?
  - Get Airflow up and running on your VM
  - Ensure you can create tunnels to access the source database or RustFS (S3)
  - Create a simple DAG to take a daily snapshot of the source database content and save it in parquet format in your data lake bucket
:::

## Evenings Plan
- DuckDB
- Getting in the (Air)Flow
  - Taskflow
  - Connections
  - Accessing Databases
  - Accessing S3 buckets
  - Installation
- Group Practice Time?


# The Database of Ducks

## A Versatile Tool
- `DuckDB` is an analytic database engine (column oriented)
- It is embeddable. It runs in the same process as your application
- No server is required. It can create a database in memory, or write to a file
  - Like SQLite but for analytics
- It is fast, works everywhere, connects to many systems, and utilizes clean SQL!


## Installation
- DuckDB is commonly used within other scripting engines, but it can also operate standalone as a CLI tool
- Install the CLI tool by following the quick guide [here](https://duckdb.org/docs/installation/?version=stable&environment=cli&platform=win&download_method=direct&architecture=x86_64) if you desire. It is already installed on your VM
- Add Python bindings so that you can use it within Python
  ```bash
  pip install duckdb --upgrade
  ```
- Add R libraries so that you can use it within R
  ```r
  install.packages("duckdb")
  ```

## Why?
- I find DuckDB to be an excellent "swiss-army knife" of working with data, but a few big perks include:
  - Querying directly from parquet/CSV data
  - Uses mostly ANSI standard SQL
  - Seamlessly connecting to multiple databases simultaneously
  - Excellent parquet support
  - It is multithreaded by default
  - Excellent [documentation](https://duckdb.org/docs/stable/sql/introduction)


## Memory or Permanent?
- When you connect to DuckDB, you can provide a file location (or not)
- If you provide no location, everything you do in the "database" is ephemeral, and is not saved beyond the individual session
  - If you try to load anything greater than your system's available memory, it will overflow to disk, but will clean up when it exits
- If you do provide a location and file (usually with a `.duckdb` suffix), then the state of the database **is** maintained, and reconnecting to it later will show the same contents
- In this class, as a tool, I use it more in "memory mode", but it is very capable as a standalone database engine as well!


## Working with Parquet
- DuckDB has excellent parquet support!
- Query directly:
  ```sql
  SELECT * FROM 'my_data.parquet'
  WHERE state = 'OR';
  ```
  - This can include multiple files simultaneously!
- Write existing data:
  ```sql
  COPY my_table TO 'my_table.parquet' (FORMAT 'parquet');
  ```
  - You can also specify compression options (though it uses a good one by default)
  - It can write partitioned Parquet files

## Attaching Other Databases
- Provided you have connection credentials, you can "attach" another database directly to DuckDB
- This allows you to use DuckDB to query directly from the database
- You can also modify or update tables within the connected database (within some limits)
- Makes DuckDB a very nice "glue" to move data between databases

```python
import duckdb

con = duckdb.connect()
URI = 'postgresql://uname:pw@host:5432/db'
con.execute(f"ATTACH '{URI}' AS pg (TYPE postgres)")
con.sql("SELECT * FROM pg.my_table")
```

## Attaching to S3
- DuckDB can also read (and write) structure data directly from S3 buckets
- Need to define a "secret"
  ```{.sql style='font-size:.8em'}
  CREATE OR REPLACE SECRET (
    TYPE s3,
    PROVIDER config,
    KEY_ID 'your access key',
    SECRET 'your secret key',
    ENDPOINT 'host:port',
    REGION 'us-east-1',
    URL_STYLE 'path',
    USE_SSL false
  )
  ```
- Then can query from it:
  ```sql
  SELECT * FROM 's3://bucket_name/filename.parquet'
  ```

# Airflow

## Why not CRON?
- In past pipelines, we just used CRON to schedule ETL events. Why is it not sufficient?
  - _No dependency management_: If one task fails, no downstream tasks are aware. So they might also fail, or **worse**, they might operate on old data, potentially duplicating or misleading further downstream tasks
  - _No retry logic_: ETL tasks commonly happen on busy networks, which means sometimes resources are busy. Any interruption causes that process to fail.
  - _Difficult to debug_: Failing tasks can be straightforward to track down, but what about tasks that are returning incorrect information?
  - _Scripts everywhere_: The individual scripts that are run with CRON tend to be scattered, and there is no easy visibility on why certain scripts are scheduled when they are.

## Airflow: The Orchestra Conductor
::::::{.cols style='align-items: center'}
::::col
- Orchestration is all about controlling when, how, and in what order thing run
- Airflow was developed in 2014 by Airbnb and then donated to the Apache Foundation, where it became open-source
- Designed to programmatically define workflows, schedule them, and track their execution
::::

::::col
![](../images/airflow_conductor.svg)
::::
::::::

## Thats DAG Cool
- Airflow defines workflows through a directed acyclic graph, or _DAG_
  - Directed: Each step logically follows a previous. Individuals steps are called _tasks_
  - Acyclic: There are no loops
  - Graph: Different steps are linked to one another

:::r-stack
![](../images/dag_ex1.excalidraw.png){.fragment .current-visible .only-fragment}
![](../images/dag_ex2.excalidraw.png){.fragment .current-visible .only-fragment width=75%}
:::

## DAG Priorities
- Each DAG thus has:
  - A clear ordering: what should happen next contingent on what has happened so far
  - Task independence: if two tasks do not depend on one another and both are available, both can run simultaneously
  - Scheduling: we can schedule when the DAG should run as a whole unit
  - Observability: we can see exactly what is happening each step of the way


<!-- already covered I think
## Conductor != Transformer
- Airflow is not meant to do the heavy lifting of transformations
- It **can** do transformations for lighter tasks, but you should keep the roles separate in your head.
- For heavy transformation workloads (billions of rows), the transformations will almost always be farmed out or triggered in some distributed system, often utilizing Docker containers
- In _this_ class, our workloads will be such that both orchestration and transformations can occur on the same system.
-->

## Creating a DAG
- Airflow utilizes Python to construct DAGs and run the orchestration
- There are generally two different syntaxes that are used. We'll focus here on the more modern (and Pythonic) which goes by the name of "Taskflow"
- Taskflow uses _decorators_ on Python functions to indicate DAGs and Tasks
- We create a DAG by defining several attributes inside the `dag` decorator:

## A TaskFlow Dag
  ```{.python style='font-size:.9em; max-height:800px'}
  from airflow.sdk import dag
  import pendulum

  @dag(
    start_date = pendulum.datetime(2026, 5, 26, tz='PDT'),
    schedule = "@daily",
    catchup = False,
    tags=['demo']
  )
  def my_first_dag():
    |||tasks and ordering|||

  my_first_dag()
  ```

## DAG Properties

|            Property | Description                                                                                                                              |
|--------------------:|------------------------------------------------------------------------------------------------------------------------------------------|
|            `dag_id` | This is provided by the name of the function. This is what will show up in the Airflow GUI.                                              |
| `schedule` | Could be a full CRON expression or a preset like `@daily`                                                                                |
|        `start_date` | The date at which the DAG should begin executing on its schedule. Note that DAGS run at the _end_ of their interval after the start date |
|           `catchup` | Defaults to `False`. Indicates whether to retroactively run missed DAG runs                                                              |


## Tasks
- A task is Airflows term for one of the _nodes_ that make up a DAG, and represents a single unit of work
- A single task should be:
  - _Idempotent_: can be run multiple times without unwanted side effects
    - This is the #1 thing that students struggle with when writing good tasks. You should be able to execute a task multiple times and have the output be unchanged.
  - _Atomic_: should be entirely self-contained


## Task Decorators
- The decorator for tasks is much simpler, though it can have optional parameters
- Tasks are defined as sub-functions within the function that is the DAG
  ```{.python style='max-height:800px; font-size:.8em'}
  from airflow.sdk import dag, task
  import pendulum

  @dag(...)
  def my_first_dag():
    
    @task()
    def clean_data():
      |||Python code to read and clean a table|||
  ```

## Operators
- Airflow defines many different _operators_, which are basic templates for general actions you might want to accomplish
  - Think, "Run a Python function", or "Run some SQL", or "Run a BASH script"
- The basic taskflow syntax assumes that everything is a "PythonOperator"
  - This is a fine starting point and keeps things simpler initially
  - We'll introduce some others later (being able to easily execute SQL is a useful one)

<!--
## Operation Time
- So what can make up a task?
- Airflow defines what it calls _operators_, which are basically templates for common actions that you might want a task to accomplish
- Fall across several categories:
  - Action operators call a script to accomplish something: `PythonOperator`, `BashOperator`, `SQLExecuteQueryOperator`
  - Transfer operators move data around: `S3ToRedshiftOperator`
  - Sensor operators wait for something to happen: `FileSensor`

## More Operators

| Operator                  | Description                                  |
|---------------------------|----------------------------------------------|
| `PythonOperator`          | Runs a Python function                       |
| `BashOperator`            | Runs a bash command                          |
| `DummyOperator`           | Does nothing (placeholder)                   |
| `BranchPythonOperator`    | Runs one of several paths based on condition |
| `ShortCircuitOperator`    | Skips downstream tasks if condition fails    |
| `EmailOperator`           | Sends email alerts                           |
| `TimeDeltaSensor`         | Waits a fixed amount of time                 |
| `SimpleHttpOperator`      | Makes HTTP requests                          |
| `SQLExecuteQueryOperator` | Runs an SQL query on a given database        |

-->

## Connecting Tasks
- So far we have just defined some tasks in a DAG, but we haven't specified how they are connected!
- Within the DAG function, you need to **call** the task functions
- You specify ordering with `>>`. So if you had two tasks:
  ```{.python style='max-height:800px; font-size:.75em'}
  @dag(...)
  def my_dag():
    @task
    def task1():
      ||| Something happens |||

    @task
    def task2():
      ||| Something else happens |||

    task1() >> task2()
  ```

## Order vs I/O
::::::cols
::::{.col style='font-size:.9em'}
- Note that this just specifies **execution order**, it says nothing about if data created by `task1` should be utilized in `task2`
- If you need data from one task in the next task, you can actually do that with Taskflow, and it will create the dependencies automatically for you
- Keep the returned things small! Don't try to pass a 500MB table between tasks!
::::

::::col
```{.python style='max-height:800px; font-size:.8em'}
@dag(...)
def my_dag():
  @task
  def task1():
    return |||something|||

  @task
  def task2(x):
    ||| Do something with x|||

  out = task1()
  task2(out)
```

::::
::::::

## Understanding DAG Context
:::{style='font-size:.9em'}
- Whenever Airflow runs a DAG, it has a host of internal variables and parameters that it is keeping track of, called the _context_
- You can access the variables within your tasks, and it can be extremely useful for setting up potential backfills
- Two main ways to access:
  - Pass in the special context variable as a parameter to a `@task` function
  - Pass in a keyword-arguments parameter to your task function. I usually use `**context`. Or import from `airflow.sdk` and use `get_current_context()`
:::

::::::cols
::::col
  ```{.python style='max-height:800px; font-size:.7em'}
  @dag(...)
  def my_dag():
    @task
    def task1(**context):
      date_str = context['ds']
      |||Use the date string|||
  ```
::::

::::{.col style='flex-grow:1.25'}
  ```{.python style='max-height:800px; font-size:.7em'}
  @dag(...)
  def my_dag():
    @task
    def task1():
      context = get_current_context()
      date_str = context['ds']
      |||Do something with the date string|||
  ```
::::
::::::


## Common Context Variables

:::{style='font-size:.7em'}

| Variable Name         | Type               | Meaning / Use Case                                                                                       |
| --------------------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `ds`                  | `str`              | The logical execution date in `YYYY-MM-DD` format. Used for templating filenames, partitioned data, etc. |
| `ds_nodash`           | `str`              | Same as `ds` but without dashes. Often used in file names.                                               |
| `logical_date `       | `datetime`         | Python `datetime` object of the logical run date. Preferred when you need a `datetime` type.             |
| `data_interval_start` | `datetime`         | Start of the data interval (inclusive). Use this for windowed data processing.                           |
| `data_interval_end`   | `datetime`         | End of the data interval (exclusive). Pairs with `data_interval_start`.                                  |
| `prev_execution_date` | `datetime or None` | Logical date of the *previous* run, if available. Useful for delta-based logic.                          |
| `next_execution_date` | `datetime or None` | Logical date of the *next* run (planned).                                                                |
| `dag_run.conf`        | `dict`             | Dictionary of parameters passed when manually triggering a DAG.                                          |

:::

<!-- doing an example as we go
## Small Example
- Suppose we wanted to run a small DAG to grab all the unique names from a file and compute the length of each.
- We might break it up as:

  ![](../images/dag_demo1.excalidraw.png){width=90%}
-->

# Break!

# Airflow Connections

## Connections
- Often, Airflows provided operators might need to interact with an outside resource
- Outside resources are often shared across multiple tasks or DAGs
- Makes sense to define all the details in a central location, and then refer to those details from within the tasks that need them
- You can find a location to set these up under the Admin → Connections option to the left in the Airflow UI

## Creating a Connection
- When you create a new connection, you need to give it a name, and choose a type
  - Airflow has been around a while: there are a **lot** of types of connections it can easily attach to!
- Once choosing a connection, it will prompt you for the standard fields of information to fill out
  - These are the usual suspects for something like a database
  - For connecting to S3 storage, we'll talk more in a moment
  - Some connection strategies will likely require an API or access key in leiu of login information
- Once configured, you need to simply save the connection, and then you are set to use it in any task you write!

## Connecting to S3
- To connect to an S3 instance you should:
  - Choose `aws` as your connection type (even if you aren't actually using AWS)
  - You are going to need your Access Key ID and Secret Access Key. Those have been provided to each group and should be saved somewhere in your documentation
  - Additionally, you need to set up some extra fields in the provided JSON:
    ```json
    {
      "endpoint_url": "http://lake.advde:9000",
      "region_name": "us-east-1",
      "addressing_style": "path"
    }
    ```

## Accessing RustFS
- You can access RustFS (from within the internal network) at `lake.advde:9001`
  - Note that this will require setting up another LocalForward in your SSH config
- Each group has had their own connection info sent to them
- Upon logging in, you will have access to a single S3 "bucket"
- You can use this to monitor files that you are adding to the bucket (or delete some if you mess something up)

## Bucket Names
- One thing that isn't saved in the S3 connection is which bucket you want to connect to
  - Imagine buckets like database tables. Generally you might have access to many on a single server
- Advisable to add the bucket name as a _Variable_ in Airflow
- Admin → Variables and then "Add Variable"
  - Give it a name and then paste in the bucket name as the value
- You can access within your Python tasks using `Variable.get({name})`
  ```python
  from airflow.sdk import Variable
  bucket_name = Variable.get("s3_bucket")
  ```

<!-- covered in next sections
## Using a Connection
- If using one of Airflow's built-in operators, using the connection is generally extremely simple, and you just provide a `conn_id` field
- If desiring to use the defined connection from within a Python function, you will likely need to "unpack" the information yourself.
  - Generally you can access the information using `BaseHook.get_connection({your conn_id})`, and then unpack it from there
  - I will provide a utility library to help with this for some common use cases
    - Database info into a database URI for instance
    - S3 connection into a DuckDB secret
-->

# Accessing Databases
## Using a Database Connection
- Initially lets focus just on accessing data from a Database
- Two likely approaches:
  - Through Pandas, with something like `pd.read_sql()`
    - Airflow actually has an easier alternative to this with `PostgresHook`
  - Through DuckDB, with the `ATTACH` method

## Getting a Pandas Dataframe
- The easiest method here is to use the `PostgresHook` function
  - You pass in the connection id for the connection you created
  ```python
  from airflow.providers.postgres.hooks.postgres import PostgresHook
  hook = PostgresHook(postgres_conn_id=|||your conn name|||)
  ```
- Then can use it to retrieve a dataframe from a query:
  ```python
  df = hook.get_df(
    sql = "SELECT * FROM mytable",
    df_type="pandas"
    )
  ```

## Getting Data in DuckDB
- There is not a built-in hook for DuckDB, so we need to extract the parameters from the connection into something we can use, namely a URI string
- I provide the utility function `get_postgres_uri_from_conn` that you can use to do this
  ```python
  from utils import get_postgres_uri_from_conn

  uri = get_postgres_uri_from_conn(|||your conn name|||)
  ```
- Then you use that to attach the Postgres database as we saw earlier, at which point you can run SQL on it as you see fit.


# Accessing S3 Buckets
## The new `ObjectStoragePath`
- Recent versions have introduced the `ObjectStoragePath` object to attempt to streamline the process of taking S3 credentials from a connection and using them directly in your tasks
- Create the object, providing the bucket and target file, as well as the connection id
  ```python
  from airflow.io.path import ObjectStoragePath

  target = ObjectStoragePath(
      "s3://bucket_name/filename.parquet", 
      conn_id=|||your conn name|||
  )
  ```

## In Pandas
- Reading parquet into a dataframe:
  ```python
  with target.open('rb') as f:
    df = pd.read_parquet(f)
  ```
- Writing parquet from a dataframe
  ```python
  with target.open("wb") as f:
    df.to_parquet(f)
  ```

## DuckDB Secrets
- `ObjectStoragePath` isn't quite there with full DuckDB support (it only supports read operations)
- Instead, we'll create and run the S3 secret directly
- To create this from a given connection id, you can import `gen_duckdb_s3_secret_from_conn` from the `utils` library I supply
  ```python
  con = duckdb.connect()
  s3_secret = gen_duckdb_s3_secret_from_conn(|||conn name|||)
  con.execute(s3_secret)
  ```

## Reading and Writing Ducks
- Then we can read in data:
  ```python
  con.execute(
    "CREATE TABLE my_tab AS SELECT * FROM 's3://bucket_name/filename'"
  )
  ```
- Or write data
  ```python
  con.execute(
    f"COPY my_tab TO 's3://bucket_name/filename' (FORMAT parquet)"
  )
  ```


# Project Beginnings

## Making the Connection
- Information for connecting to the VM has been sent out to all groups
- This includes:
  - The domain of the jump host
  - Your user name (on both the jump host and your vm)
  - The internal domain of your group's vm
- All group members share an account (but you can still have multiple members logged in simultaneously)


## The Initial Lay of the Land
- If you can connect, you will notice several folders in your home directory: `airflow`, `warehouse`, and `documentation`
- Each will eventually have a `docker-compose.yml` file within it
- Currently, just the documentation is already set up
- This documentation is already being hosted to website that you can access (with the proper tunnel) at port 8000
  - Much you can worry about later, but it _does_ have a "VM Tools" page where I explained the core tools that are installed on each VM for your use

<!--
- Enter that folder
- Grab the latest `docker-compose.yml` from Airflow's official source:
  ```bash
  curl -LfO 'https://airflow.apache.org/docs/apache-airflow/3.0.1/docker-compose.yaml'
  ```
-->

## Airflow Installation
- While the installation procedure for getting Airflow is _significantly_ simplified by Docker Compose, there are still several steps to follow
- Follow the guide [here](https://people.willamette.edu/~jjrembold/class_files/data513/Guides/Airflow_Setup.html) closely to ensure everything goes well
  - Note that you only need one installation of Airflow on your group's VM!
- Once installed and running, you can access the interface on port 8080
  - The default credentials have `airflow` for both the user and password, which is fine


<!-- moving all to just the guide
## Anatomy of Airflows Docker Compose
- There is a lot going on here! Things to look out for:
  - Defining a common set of settings near the top (airflow-common). These are to easily be able to reuse the same settings for multiple services below
  - Services being launched:
    - A Postgres server to manage metadata
    - A Redis server for caching
    - The api-server to let various services talk to Airflow
    - The scheduler
    - The dag processor that checks on and reads in new DAGs
    - A worker to actually process the tasks
    - A triggerer to list for certain situations
    - An init task that only runs the first time you launch the stack
    - The airflow-cli to let you interact through the command line


## Tweaks: Building
- You will likely come across extra tools/libraries you want access to in your DAGs that are not pre-included
- The best way to handle this is to extend the base image
  - Create a Dockerfile in the same directory
    ```dockerfile
    FROM apache/airflow:3.0.1
    RUN pip install --no-cache-dir "apache-airflow==${AIRFLOW_VERSION}" apache-airflow-providers-fab==2.0.2 duckdb #any other libraries you want
    ```
  - At the start of the `airflow-common` section in the Docker Compose:
    - Comment out the `image:` line
    - Uncomment the `build: .` line beneath it
  - Run `docker compose build` once to build the new image


## Tweaks: Don't Load Examples
- By default Airflow will load something like 70 different DAGs as examples
- These are useful, but they also demand the DAG parser to parse **way** more than you would usually need, and thus eat up system resources
- I suggest toggling them off, changing the `AIRFLOW_CORE__LOAD__EXAMPLES` environment variable to `'False'`
- You can always still look at these example online [here](https://github.com/apache/airflow/tree/3.0.1/airflow-core/src/airflow/example_dags)


## Tweaks: Timings
- Generally, only one instance of Airflow is running on a system, but that is not our case this summer
- It can greatly help resource load to dial back some of the timing settings to not be so greedy.
- In the common Airflow environment variables, add the following:
  ```{.docker-compose style='white-space: pre; font-size:.8em;'}
  AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT: 20000
  AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL: 600
  AIRFLOW__DAG_PROCESSOR__MIN_FILE_PROCESS_INTERVAL: 20000
  AIRFLOW__DAG_PROCESSOR__BUNDLE_REFRESH_CHECK_INTERVAL: 600
  AIRFLOW__DAG_PROCESSOR__STALE_BUNDLE_CLEANUP_AGE_THRESHOLD: 20000
  AIRFLOW__DAG_PROCESSOR__STALE_BUNDLE_CLEANUP_INTERVAL: 20000
  AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: 300
  ```
- Change the interval on each health check to 300 instead of 30


## Running it!
- Now you should be good to go! You can launch the stack with `docker compose up`
  - Once you know it is working, it would probably make more sense to launch in background with the `-d` flag
- Airflow makes its web interface available on port 8080 (as you can see in the compose file)
  - **That** is what you will need to open a tunnel to in order to be able to view on your own system
- Any DAG you add to the `dags` folder inside your `airflow` folder should show up in the interface within a few minutes!
-->

# Introducing the Project

## The Theme
:::{style='font-size:.9em'}
- All the data sources you will be working with this semester will be simulating the metro system in the city of Prague
- Starting out, this will largely focus around metro lines and riders entering and exiting at various stops
- As the semester progresses, it will grow to include:
  - Fare information for different riders
  - Station details and daily flow patterns
  - Graph DB models of how stations and lines are connected
  - GPS events and sensor alerts of trains traveling on the tracks
- You will be responsible for synthesizing this all into your data lake and data warehouse in order to build dashboards to deliver on specific business decisions/interests
:::

## Starting Data
- Initially, your data source will just be a transactional Postgres database
- In it are tables for:
  - Stations
  - Routes
  - What stations are on what routes
  - Passengers
  - Tap ins/outs
- This data will be updated every 15 minutes
- We are focusing on _batch processing_ this semester, so you will be working largely with daily snapshots


## This week
- The data should become active on Wednesday or Thursday
- At that point, you should endeavour to have the following done by next week:
  - Airflow up and running
  - A DAG written that writes a copy of the database data to parquet file(s) each day
    - One parquet file or multiple? Up to you.
- If you can have that achieved by next week, you'll be in a great spot!
