TaskFlow, Connections, and Warehousing

Jed Rembold

June 4, 2025

Announcements

I’m still working on a formal syllabus and schedule, sorry for the delay
For next week:
- Simulations will be live starting Friday morning
- A guide for the first business decisions will be out by the end of Friday
  - Start working out your dimensional models
- Should be building pipelines to:
  - Daily take a snapshot of data sources and dump it into S3 storage
  - (Probably also daily) update your data warehouse
- Start documenting! As you ramp up this project, you are going to be making lots of decisions. Make sure they are documented and that everyone understands the ramifications of those decisions.
- Reflections should be starting! I will post a template and source by end-of-day Friday

Current State

Currently, you should be set up and able to:
- SSH into your group’s server
- Ensure that your Airflow stack is running, and bring it up or down as necessary
- Access the Airflow UI through your browser
If you can not, then you are behind and need to catch up to ensure you can contribute meaningfully to your group!

Tonight

Airflow Connections
TaskFlow Syntax
Documentation
Intro to Data Warehousing

Airflow Connections

Connections

Often, Airflows provided operators might need to interact with an outside resource
- The SQLExecuteQueryOperator for instance, needs to know what database to run the SQL query against
The outside resources are often shared across multiple tasks or DAGs
Makes sense to define all the details in a central location, and then refer to those details from within the task
You can find a location to set these up under the Admin → Connections option to the left in the Airflow UI

Creating a Connection

When you create a new connection, you need to give it a name, and choose a type
- Airflow has been around a while: there are a lot of types of connections it can easily attach to!
Once choosing a connection, it will prompt you for the standard fields of information to fill out
- These are the usual suspects for something like a database
- For connecting to S3 storage, we’ll talk more in a moment
- Some connection strategies will likely require an API or access key in leiu of login information
Once configured, you need to simple save the connection, and then you are set to use it in any task you write!

Connecting to S3

To connect to an S3 instance you should:
- Choose aws as your connection type (even if you aren’t actually using AWS)
- You are going to need your Access Key ID and Secret Access Key. You will need to set those up in MinIO
- Additionally, you need to set up some extra fields in the provided JSON:
```
{
  "endpoint": "minio.advde:9000",
  "region_name": "us-east-1"
}
```

Accessing MinIO

You can access MinIO (from within the internal network) at minio.advde:9001
- Note that this will require setting up another LocalForward in your SSH config
Each group has had their own connection info sent to them
Upon logging in, you will have access to a single S3 “bucket”
To connect outside sources to be able to add or remove things from this bucket, you need to create an access key
- Access keys are tied to your account, not the bucket that you currently have access to.
  - So probably don’t add them to the project documentation
  - Maybe writing them down in your group’s private discord channel

Bucket Names

One thing that isn’t saved in the S3 connection is which bucket you want to connect to
- Imagine buckets like database tables. Generally you might have access to many on a single server
Advisable to add the bucket name as a Variable in Airflow
Admin → Variables and then “Add Variable”
- Give it a name and then paste in the bucket name as the value

You can access within your Python tasks using Variable.get({name})

from airflow.models import Variable
bucket_name = Variable.get("s3_bucket")

Using a Connection

If using one of Airflow’s built-in operators, using the connection is generally extremely simple
- The SQLExecuteQueryOperator, for instance, just takes a conn_id field
If desiring to use the defined connection from within a Python or Bash operator, you will likely need to “unpack” the information yourself.
- Generally you can access the information using BaseHook.get_connection({your conn_id}), and then unpack it from there
- I will provide a utility library to help with this for some common use cases
  - Database info into a database URI for instance
  - S3 connection into a DuckDB secret

TaskFlow Syntax

What is TaskFlow

Starting in Airflow 2.0, a newer syntax was added for writing DAGs and Tasks
Called the TaskFlow API, it embraced a more “Pythonic” approach to writing tasks and DAGs
- This is especially useful if PythonOperators were forming the bulk of tasks previously
- Additionally, it greatly simplifies passing information between tasks
Generally amounts to:
- Using function decorators to label tasks
- Using function composability to automatically detect dependencies

The dag decorator

We know that we can define a DAG using the DAG() function
Often this is also used with a context operator, to remove the need to assign the output to a value
```
with DAG(...) as dag:
  |||Define tasks and dependencies|||
```
In taskflow syntax, we can define a function and make it a DAG using the @dag operator
```
@dag(...)
def my_dag():
  |||tasks and dependencies|||
```

The task decorator

If what we want is a PythonOperator type task, we can just define a Python function and add the @task decorator
```
@task
def clean_table():
  |||Python code to read and clean a table|||
```
The task id is automatically assigned to the name of the function
The dag is assigned to whatever dag ends up calling this function

Input and Output

In the original syntax, passing information from one task to another was quite difficult
Relied on complicated and cumbersome Xcom pushes and pulls
In TaskFlow, it is as simple as returning values from one task and using them an input in another

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract(**kwargs):
    data = [1, 2, 3]
    kwargs['ti'].xcom_push(key='data', value=data)

def transform(**kwargs):
    data = kwargs['ti'].xcom_pull(key='data', task_ids='extract')
    transformed = [x * 10 for x in data]
    kwargs['ti'].xcom_push(key='transformed', value=transformed)

def load(**kwargs):
    transformed = kwargs['ti'].xcom_pull(key='transformed', task_ids='transform')
    print(f"Loaded data: {transformed}")

with DAG('classic_xcom_dag', start_date=datetime(2023, 1, 1), schedule_interval='@daily', catchup=False) as dag:
    t1 = PythonOperator(task_id='extract', python_callable=extract, provide_context=True)
    t2 = PythonOperator(task_id='transform', python_callable=transform, provide_context=True)
    t3 = PythonOperator(task_id='load', python_callable=load, provide_context=True)

    t1 >> t2 >> t3

Input and Output

In the original syntax, passing information from one task to another was quite difficult
Relied on complicated and cumbersome Xcom pushes and pulls
In TaskFlow, it is as simple as returning values from one task and using them an input in another

from airflow.sdk import dag, task
from datetime import datetime

@dag(start_date=datetime(2023, 1, 1), schedule_interval='@daily', catchup=False)
def taskflow_dag():

    @task
    def extract():
        return [1, 2, 3]

    @task
    def transform(data):
        return [x * 10 for x in data]

    @task
    def load(transformed):
        print(f"Loaded data: {transformed}")

    data = extract()
    transformed = transform(data)
    load(transformed)

dag_instance = taskflow_dag()

Automatic Dependencies

If the output of a task is used as an argument to another task, TaskFlow automatically creates the dependency!
You can still use the >> syntax (or .set_upstream()) to create other dependencies if not every task feeds into the next
Think of the variable that you assign a task function to as encompassing:
- The task itself
- The data returned from the task
When one task calls another, it not only defines the dependencies but sets up the Xcom push/pulls in the background

Parameterizing DAGs

The TaskFlow syntax also makes it simple to add parameters to the DAGs themselves
```
@dag(...)
def my_dag(name: str = 'Jed'):
  |||defined tasks|||
```
When you run a DAG manually, you can set these parameters to whatever you want, which might be useful for testing
When the DAG is run on a schedule, it will always use the default value, so you must set one (assuming you’ll be running the DAG on a schedule)

Understanding DAG Context

Whenever Airflow runs a DAG, it has a host of internal variables and parameters that it is keeping track of, called the context
You can access the variables within your tasks, and it can be extremely useful for setting up potential backfills
Two main ways to access:
- Pass in the special context variable as a parameter to a @task function
- Use get_current_context() from airflow.operators.python. Which gets you a dictionary of all the available variables

Common Context Variables

Variable Name	Type	Meaning / Use Case
`ds`	`str`	The logical execution date in `YYYY-MM-DD` format. Used for templating filenames, partitioned data, etc.
`ds_nodash`	`str`	Same as `ds` but without dashes. Often used in file names.
`logical_date`	`datetime`	Python `datetime` object of the logical run date. Preferred when you need a `datetime` type.
`data_interval_start`	`datetime`	Start of the data interval (inclusive). Use this for windowed data processing.
`data_interval_end`	`datetime`	End of the data interval (exclusive). Pairs with `data_interval_start`.
`prev_execution_date`	`datetime or None`	Logical date of the previous run, if available. Useful for delta-based logic.
`next_execution_date`	`datetime or None`	Logical date of the next run (planned).
`dag_run.conf`	`dict`	Dictionary of parameters passed when manually triggering a DAG.

Break Time!

Documentation

What to Document?

Part of the focus in this class is documentation, so that projects you work on can be easily handed off to others
There are many ways to manage documentation, but more important is thinking about what things you should be documenting
Some things to document depend on context, but in this class you should strive to document:
- The pipeline architecture
- Data dictionary and lineage
- Testing/validation and monitoring

Architecture

Visuals can be extremely useful here! (flowcharts, DAGs, etc)
Should document the components of each stage (including technologies used) and how data moves between stages
Usage of Architectural Design Records should be practiced and documented!
- A starting template of what these look like will be provided, but you can adapt as you see fit

Architectural Design Records

A short, focused document explaining a key technical decision you made, why you made it, and what it means for the future
- Think of them as a sort of project “note-to-self”
Not necessary for tiny decisions, but should document when you:
- Choose a particular tool for a task
- Change how the data flows
- Change the data model

Good ADRs

A good ADR provides:
- Context/Motivation: Why was this decision necessary and what constraints were you working with? Did you consider alternatives?
- The Decision: What did you decide and why? Will there be tradeoffs or limitations?
- Consequences: What are short and long term implications? Are any follow-up actions necessary?
- Traceability: Link or connect back to past ADRs. You should never change an ADR, just create a new one.
ALL group members should approve each ADR
Tailor the ADR to its audience! Here that is your own group or others like you in this class. Make sure it is understandable.

Data dictionaries and lineage

You should clearly document the data model you are using in your warehouse
- ERDs can still be great for this
Where did each bit of that data come from?
- Someone should be able to trace information back through your pipeline to its source
- Any transformations that occurred along the way should be clearly documented here or in the architecture portions

Testing and Validation

The real world is messy, and thus data pipelines need to contend with a variety of issues
- Incorrect data, late data, missing data, malformed data, etc.
How are you checking to make sure the data making its way into the warehouse is of the highest integrity?
How will you know if something is wrong?
Given the above, how quickly should you expect to realize that something is wrong?

Intro to Data Warehousing

Dimensional Modeling

Data Lakes are exceptionally versatile in what they can store, and can be operated fairly cheaply given modern storage
But they do not otherwise impose an organizational format on the underlying data
The data warehouse has historically served that purpose: moving data out of the lake and imposing order to it in a way that makes it useful
Even current trends of unifying the data lake with the warehouse don’t remove the necessity of imposing this order onto the underlying data
Dimensional Modeling is the process of determining how to store this data in a way that best facilitates analysis to answer business questions
The “bible” in this area is The Data Warehouse Toolkit, by Kimball and Ross, now in its third edition

Different Models

Primary purpose: facilitating transactions and maintaining state
Focus: data integrity
Models: Normal Forms
Freshness: real-time

Primary purpose: enabling analysis to answer business decisions
Focus: speed (and resources)
Models: Dimensional (denormalized)
Freshness: Batched (often nightly)

Given their different use-cases, we should not be suprised at differences in organizing the data!

Fact Tables

The fundamental table type that dimensional modeling tends to revolve around are fact tables
A fact table stores measureable, quantitative data about some business process
- Eg. sales amount, quantity sold, exam scores
Capture events or transactions with numeric measures
Contain foreign keys which link to associated dimension tables
Typically very large and growing over time
- Append-only

Dimension Tables

Dimension tables are the smaller cousins to fact tables, and store descriptive attributes that provide context for facts
- Eg. sales items, product category, exam number, student name
Attributes are often textual or categorical
These are usually much smaller than the fact tables are “relatively” static
- The same names, items, categories, etc appear in many many facts
Generally used to filter, group, and label the facts as they pertain to different business decisions

Example

Distinguishing Facts

Answers questions of the form:
- how much?
- how many?
- how often?

Answers questions of the form:
- who?
- what?
- where?
- when?

Combining Facts

Individual rows in a fact table should model a business process or event
If multiple facts all fit a given event and have shared dimensions, they usually belong in the same fact table
- All the pricing data in the NYC Taxi rides dataset
If the facts represent different events or granularities, they should go in separate fact tables
- Rule of thumb: If the chosen columns of your fact table would result in requiring NULL values for some rows, then you have likely mixed events or granularities, and should rethink your modeling

Benefits of Dimensional Modeling

Optimized for fast query performance
- Generally relatively few joins are necessary
- Usually aggregating over the fact table in some fashion
Enables historical and trends analysis
- Date is almost always factored in
Gives clean, consistent reporting for downstream business decisions

Getting Started

Facts model important business processes, so you need to start by identifying the key business processes that you are interested in analyzing
Once you have the processes down, work out what level of granularity of the events you might need for each
Then you can work out what facts you have (or can compute) from the underlying data that describe that process at the desired granularity
Identify what dimensions you have access to that can provide additional context
Create your star schema database model

Activity: Course Analytics

You’ve been hired to build a data warehouse for a university to support academic performance reporting
The dean wants to answer questions like:
- How are students performing by course, instructor, and department?
- What’s the pass rate for 100-level vs 400-level courses?
- How many credits are earned across departments each term?

The university collects the following data:

Student ID
Course ID
Instructor ID
Term
Grade received
Credits attempted/earned
Enrollment date
Department

Creating the Warehouse

Back to Postgres!

After much thought and testing, we will be using Postgres for our data warehouse
- Not technically column-oriented, but it will be fast enough for our purposes
- Everyone is already familiar with it and can access it with various clients
- Integrates easily with Airflow and Grafana
- Keeps the focus on the newer dimensional modeling skills

Docker Composing Postgres

Each group will be responsible for spinning up an instance of Postgres using Docker Compose in their VM
Some things to keep in mind:
- You can define environment variables for:
  - POSTGRES_USER
  - POSTGRES_PASSWORD
  - POSTGRES_DB
  to set logical defaults for when the server runs
- You probably want to set up a volume to ensure persistant storage
  - This is important! You don’t want to lose your entire warehouse if the container restarts! Test it!