The Role of Pipelines

Jed Rembold

May 12/13, 2026

Announcements

Resources are coming online, are these all up?
- Course Webpage
- Canvas Page
- Discord Channel
Looking forwards
- Check out the full syllabus
- Respond to potential group preferences poll
- If you are feeling rusty, some past content to review:
  - SSH-ing into a system
  - Docker container basics
  - Basic SQL

Evenings Plan

Data Architecture
Syllabus stuff
Transformations review activity

An Introduction

Who am I?

Name:

Jed Rembold

Background:

PhD in Physics with specialization in Astrophysics

Now full-time in the CS and DS departments

Teaching you the database skills I wish I’d had years ago

Office:

Ford 214

Hours: Whenever you catch me on Discord, or by appointment

Email:

jjrembold@willamette.edu

More Fun Esoteric About Me

Here are a few more random bits of Jed trivia:
- I wanted to be an astronaut when I was little
- I use the Dvorak keyboard layout instead of Qwerty
- Ice cream is the best food
- I have two cats named Phobos and Deimos

Big Picture

The Data Lake

Really just a storage location
- Usually in the cloud (S3 buckets)
Snapshots of structured data and any unstructured data are dumped here
For cost reasons, we will mimic S3 in this class with MinIO, a local S3 deployment
- Functionally identical to things like AWS S3 buckets

Orchestration

We need methods to “stitch” together the various parts of our pipeline
To do this reliably, these connections often have several steps, which can make pure Cron scheduling rigid and unflexible
To solve this issue, we have software called orchestrators, which:
- Maintain and enforce dependencies between different steps (this must happen before this can happen)
- Allow for the scheduling of groupings of tasks based on a schedule or reactive (this must happen at 5pm every day, or this must happen when a new file shows up in this bucket)
- Facilitate connections between many different tools (databases, storage blocks, monitoring software, etc.)

Orchestrator vs Transformer

There are technically two things here that are sometimes combined, sometimes not:
- The code that you want to run (the transformer)
- The code that governs the orchestration pipeline itself
For smaller pipelines, these things can often be bundled together
For large pipelines, you absolutely want the transformer code separate: it will likely run on entirely different hardware

Airflow

Around the longest (2014), with still the greatest market share
Pipelines all arranged around DAGs (Directed Acyclic Graphs), described in Python
Individual tasks executed in a wide variety of languages/tools
Still actively maintained and improved, but does show its age in some things

Other Orchetrators

Airflow is by no means unique, and there are a host of other pieces of orchestration software, each with its pros and cons
Prefect
- Very Python-centric, dynamic DAGs, clunkier backfills
Dagster
- Orchestration is Python, but tasks can be anything, asset-based
Kestra
- Workflows defined in YAML

Data Warehouses

Warehouses are generally optimized for analytics, which makes them fundamentally different from transactional databases
- Data is stored in column preference rather than row preference
- Metadata about columns is usually stored as well, for faster lookups
- Often NOT normalized
  - Lots of joins in analytics is expensive and undesirable
  - The data should not be changing much, so the need of normalization to help enforce data quality is less strong
While we could utilize dedicated OLAP databases here, like Clickhouse, I think it is a bit too much extra mental overhead
- Instead we’ll use Postgres more AS an OLAP warehouse
- If needed for speed, can always implement the pg_duckdb extension

OLAP Cubes

The full data warehouse is often still more unwieldy or normalized than certain analytics may desire
OLAP cubes are essentially specialized portions or versions of the warehouse custom packaged to enable quick and straightforward querying for specific tasks
May be linked to things like C-block dashboards, machine learning flows, or further analytic APIs

Syllabus Stuff

Learning Objectives

By the end of this class you should be able to:

Understand, model, and implement a modern data pipeline architecture: including data lakes, data warehouses, and the orchestration binding them all together
Connect and work productively on remote systems
Thoroughly document data pipelines, including lineage, data dictionaries, and architecture design records
Quickly learn and adjust to new data pipelines designed by others
Ingest and synthesize data from a wider variety of sources, including relational, graph, and document databases, logs and metrics, and event queues
Build architecture and dashboards to meet evolving business and technological requirements

Projects

This is fundamentally a single project based class
The goal is to work in groups of about 3 to build a data pipeline of progessing complexity over the course of the semester
The data sources will be the same for everyone, and provided by myself in the form of simulated data.
- The theme will revolve around the Prague metro system
The details of your pipeline will be largely up to you and your group. The output will be evaluated in the form of dashboards that address a particular business need or concern every 1-2 weeks

The Wrinkle

At two points throughout the semester, everyone will rotate to new projects!
- Approximately midway through June and midway through July
The truth is that you will almost never take a job where you are starting from scratch with a data pipeline.
- There is almost always going to be some portion of a pipeline present that you need to learn, adapt to, and then build on top of
- Stresses the importance of proper documentation

Ramifications

Because others will be inheriting your project, we will place an importance on documentation
You must document:
- What you have done. So that new groups can understand what they are working with.
- Why you made certain choices. New groups are going to need to build on top of your infrastructure, which commonly might mean adapting it. Are they safe to delete a particular table or workflow? If they change a timestamp field to a timestamp field with a time zone, is that going to break anything?
- How everything is linked together. This is commonly called the lineage of the data, and your orchestrator can help with visualizing some of this.

Documentation Artifacts

Each project will have a directory within it where all of this documentation can live. Images are also an important part of documentation and can live in the same directory.
I’m exploring self-hosted wiki options for each group to better to able to write and render their documentation
Each rotation, I will evaluate the documentation myself, as well as getting feedback from the group that inherited your project

Reflecting

I will be asking each student to keep a reflection journal over the semester
One entry to be added each week
Template will include things like:
- What have you learned from working with your data pipeline this week?
- What are you currently struggling with the most?
- How have you contributed to your group’s pipeline this week?
Score credit/no credit

Hardware

Remote Servers

Much of the work you do this semester will be on a remote system
- I am setting up virtual machines for each group on my home server
- Data sources will also be hosted on the same server
You will need to SSH into a specific address to connect, and I will require SSH keys for security purposes
- I will have you all create a keypair and then share your public key with me so that I can see about adding it to each virtual machine
- My understanding is that you didn’t get a chance to play with keys in 503, so we’ll introduce it next week
Some software has web interfaces, which you’ll need to SSH tunnel to access. This is straightforward with SSH config, and we’ll also talk about it next week.

Accessibility

I am hosting everything on my home server to ensure that you get a good experience without having to pay anything
There are some drawbacks to this:
- While I continue to upgrade my server to prepare for this class, its resources are not infinite.
  - Each VM will be limited in what it can do, but it is still possible we might see bottlenecks of system resources. Please try to respect the system.
- I am technically letting you into my home network. There are some isolation safeguards in place, but it is possible my security has faults. Please respect my home network, and only interact with resources, locations, and servers that I have provided to you.
  - Flagrant violations of this will result in being locked out of my system, at which point you will not be able to complete the coursework. And that is clearly bad.

And Finally…

I continue to do a ton of technical prep to try to ensure I am prepared for this course and can make it the best experience possible
That said, this is the second time this course is being taught
- I promise it will still have at least a few rougher edges
I have tried to plan for as many contingencies, lesson structures, and content considerations as possible, but undoubtably some unexpected things will pop up
Please be flexible and gracious!
- I think and want this course to be very cool and instructive. If it falls short in some areas, it won’t be for lack of effort and time on my part.

Break!

Dinner Break

Crash Course Refresher Activity

Remember what you learned?

This course is going to hit the group running next week, and the expectation is that you remember things you have done in previous courses (shocking, I know)
- For some of you, that may have been over a year ago!
In particular, we want to make sure people are feeling decent about:
- Writing scripts (basic Python in particular)
- Docker containers
- Interacting with databases
- SQL
The following activity will be done in pairs, and is the only other thing we are doing tonight

The Task

The Docker container jrembold/advde_wk1 will create an internal CSV of individuals, birthdays, current age, and favorite foods when run
- Creates it in the internal /app/data folder
This CSV has some problems with missing data that you’d like to clean up
- You should impute these from other data in the table. Do not just drop them!
You’d like to create a table from the CSV that shows, for each favorite food, what the average age of all the individuals with that favorite food is
You’d like to load this table into your local Postgres database

Along the way

How you accomplish this task is largely up to you and your partner, including the order of any steps. What matters is that the final table is in your database and factually correct.
As you go, create a flowchart of each of the major steps you took enroute to your destination
Inspect your flowchart.
- Would it have made sense to break apart some steps to have better insight into what happened in the middle? Big steps tend to hide details and make debugging more difficult.
- Is this process automatable? Or did you inject human decision-making somewhere into the process? Could you eliminate that? Should you?

Tuesday Partners

Talia and Julian
Ana and Ian
Rebekah and George
Aaron and Logan M
Samuel and Chris

Graham and Victoria
Amel and Jennifer
Georgia and Logan R
Brianna and Elliana
Hayden, Margaret, and Hodan

Wednesday Partners

Brooke and Serenna
Sophia and Brandon
Alexander and Manish
Siera, Spencer and Tiffany
Jackson and Sarah
Mary and Bradley

Shanti and Luca
Emery and Aaron
Aiyana and Emily
Ben and Seira
Courtney and Rohan
Addison and Dylan
Simon, and Amaya