Window Dimensions and Document DBs

Jed Rembold

June 25, 2025

Announcements

  • The swap happened! You’ll have some time to check things out with your group later tonight
  • Next Milestone deadline will be in 2 weeks, so you have some time to learn your newly inherited project
  • I am aiming to have the new data source live and the milestone posted by this weekend
  • Don’t forget your weekly reflections!

Tonight

  • Alternative ways to create dimension/fact tables
  • Introducing Document Databases
  • A new data source?
  • Work time

Windowing Dimensions

Lake 2 Warehouse: 2 approaches

  • If your data warehouse is being populated directly from source data, you essentially always have to update it “in-place”
  • But if you have a data lake of snapshots, you have some options:
    • Populate the warehouse incrementally, pulling from each day’s snapshot
    • Recreate the warehouse tables from scratch each time, pulling from all the snapshots enmasse

Incremental Loads

  • Very efficient: only working with 1 day’s data at once
    • If you are working with massive amounts of data, that could be very important
  • Incremental changes “fit” easier with the concepts of Type-2 SCDs
  • Non-trivial to determine what rows need to change or be updated
  • Repopulating the warehouse means backfilling across each incremental daily load

Recreating from Scratch

  • A single run populates the entire dimensional model
  • Looking at the entire history at once, so easier to determine what rows changed when
  • Requires a good understanding of window functions to achieve easily
  • “Wasteful” in the sense that much of the dimension table creation is likely being duplicated work from what existed before

Recreation Requirements

  • In order to regenerate a dimension (or fact) table from scratch you need:
    • The historical record of snapshots
    • Knowledge of when the snapshot happened
      • This could be written to a column or in the snapshot name itself
    • A tool that can read in globbed files (DuckDB works very nicely for this!)

Reading in the Snapshots

  • DuckDB can accept BASH globs to expand out a pattern into a list of files

  • Usually this would utilize a wildcard character * to represent “some amount of anything”

    • For example, *.parquet would expand to represent all of the parquet files in a given folder
  • Reading in a bunch of snapshot parquet files into DuckDB is then as easy as:

    SELECT *
    FROM 's3://your-bucket/*.parquet'
  • This essentially reads in each parquet file to its own table, and then UNIONs them all together

Reading Options

  • Sometimes it is useful to know what file each row came from, if you want that, then you need to use the full read_parquet function, and add a filename=true parameter

    SELECT *
    FROM read_parquet('s3://your-bucket/*.parquet', filename=true)
  • For snapshots, your schema are likely the same, so the union is obvious. But what if your parquet files have slightly different schema?

    • Likely want to add the union_by_name=true option

Steps: from history to dimension

  • Once you have all your historical data read in with a column of when each snapshot happened, you can proceed to create your dimension table
  • Generally looks something like:
    • Filter down to only the data you want in your dimension table, if necessary
    • Add a hashed column representing all of the data that might change which you want to track
    • Identify when a given row changed by using LAG, partitioning by your row identifier and ordering by the snapshot times
    • Filter down to only keep the original rows or the rows where a change happened
    • USE LEAD to get when the next version would have started, subtracting a day to get when that rows version ended
    • Select out just the columns you want for your final dimension table

Hashing Rows

  • Often we need an easy way to compare the contents of many columns

  • Different approaches, but one common one is to concatenate all the data as text, and then run a hash function on the row

    • DuckDB can use both the md5() or hash() hash functions
  • Might look something like:

    SELECT pid, fname, age, hash(fname, age) AS current_hash
    FROM |||mytable|||

When data changes

  • To identify when data changes, we LAG the hash so that we can compare one rows hash to the previous
    • Usually call it something like prev_hash
    • Same hash (current_hash = prev_hash)? Then no change!
  • To compare the appropriate rows, we need to partition by whatever is the unique identifier of a record in that dimension table
    • This is not the primary key, which uniquely identifies every row (and which we are not concerned with at the moment)
    • These are identifiers for what represents a given piece of information (pid, station_id, etc)
  • Remember that ordering matters! So you want to order by the snapshot date so that lagging a row compares each row to the next snapshot

Keeping Originals and Changes

  • You likely have many snapshots where little or no data actually changed
  • You don’t care about those for your dimension table, you only care about the original data and whenever that data changed
  • Filter your table from the last step to only include rows where:
    • The previous hash is NULL - this implies the original entry of that row
    • The current hash is different from the previous hash - this implies that the data changed at that point

Determining the Effective and Expiry dates

  • The effective date for a given row at this point is just its snapshot date: when it entered the system
  • To get the expiry date, you can LEAD the snapshot dates to see when the next version’s snapshot began
    • Then subtract 1 from this to showcase that the current row expired one day before the next version began
      • If taking snapshots more often than once a day, adjust the interval to subtract accordingly

Trimming down and adding is_active

  • As a last step, you would usually just select out the columns that you want in your final dimension table (none of the hash stuff)

  • Technically, you could always find the current, most recent entries by filtering for where expiry_date IS NULL

  • If you want something like an is_active column, you could add such a column as a boolean easily with

    SELECT pid, fname, age, effective_date, expiry_date, 
      expiry_date IS NULL AS is_active
    FROM |||last steps table|||

Tweak as needed

  • This process should work for the vast majority of ways that your data could be changing
  • Think of this as a process though, not a recipe
    • Important that you understand the why of each step, so that you could adapt it to a slightly different situation as necessary
  • At the end of the day, you can create your dimension tables however you please: I am just trying to make you aware of options

Document Databases

Moving Beyond RDBS

  • We have spent a lot of time focusing on relation databases for good reason: they are very popular and capable!
  • They do have their drawbacks though, and for that reason other database models exist
  • The next several weeks we are going to spend learning the basics of how to access these models:
    • Mostly focusing our efforts around access and querying, assuming that such a database has already been created
    • We will not get into as much how to model or administrate these sorts of databases
  • Initially we will be starting with Document Databases

The Document Model

  • Sometimes referred to generally as NoSQL to highlight its differences from relational models
  • Best for situations where data exists largely in self-contained documents and outside relationships are fairly rare
  • Major examples: XML, JSON, MongoDB
  • One-to-many relationships give rise to a tree structure

The Document Model Pros/Cons

  • Strength of the document model:
    • It handles one-to-many relationships very well
    • More closely mimics object-oriented programming, so less impedance mismatch
    • Schema are generally quite flexible: not every record has to have the exact same information
    • Data is generally stored together or close to related data, so lookups can be fairly efficient
  • Weaknesses of the document model
    • Struggles more with many-to-many relationships
    • Joins are more difficult, and possibly need to be done entirely outside the database in some cases

MongoDB

  • The variant of document database that we will focus on is called MongoDB
  • Pretty handily the market favorite as far as document databases go
  • Open-source, with local deployment options and free cloud deployment options through MongoDB Atlas

Document DB Basics

  • The equivalent of a “table” in a document database is called a collection
  • Collections are populated by documents, where a document is comprised of an assortment of keys and corresponding values
    • Keys need to be unique within a document
    • The corresponding values may, themself, be another document. Leading to a nested, hierarchical structure
    • Documents do not need to have the exact same keys, even if they are part of the same collection

Document Syntax

  • Different document databases handle exactly how they comprise and store their keys and corresponding values in different ways
  • MongoDB uses the very known and popular JSON framework though, so this is pretty straightforward
  • Each document is a valid block of JSON keys and corresponding data

MongoDB Interactions

  • Unlike most RDBMS which use some variant of SQL to interact with the database, document databases usually have more bespoke interfaces
    • Usually ported to a wide variety of languages though
  • For MongoDB, those of most interest are likely:
    • Mongosh: MongoDB’s cli interface client (similar to psql for Postgres)
    • PyMongo: The Python library connecting and querying Mongo databases
    • MongoLite: The most recent R library for working with Mongo databases
  • The syntax between these is very similar

Making Connections

  • The primary way to connect to a Mongo database is through a connection string, extremely similar to how it worked with Postgres

  • The general form of the string is:

    mongodb://|||username|||:|||password|||@|||host|||:|||port|||/|||database|||
  • In Python then:

    from pymongo import MongoClient
    
    client = MongoClient(
      "mongodb://jed:mypswd@localhost:27017"
    )
    db = client['database']

Accessing a Particular Database

  • Once you are connected, you can see what databases are available:

    print(client.list_database_names())
  • And then you can switch to that database with:

    db = client.get_database(|||database name|||)
  • To see the available collections within the database:

    print(db.list_collection_names())

Inserting Documents

  • You can insert documents one at a time:

    db.|||collection|||.insert_one(
      {"name": "Jed", "teaching": ["DATA510", "DATA599"]}
    )
  • Or many at once:

    db.|||collection|||.insert_many(
      [{"name": "Jed", "teaching": ["DATA510", "DATA599"]},
      {"name": "Rachel", "teaching": ["DATA510", "CS549"]}]
    )

Normalization

  • What if you wanted to insert the same document into two different collections (or parent documents)?
  • Two main methods arise:
    • Embedding: you just copy the contents of the document into each of the various places. Note that this results in data duplication, but everything is self contained
    • Referencing: you insert a reference to the ObjectID of a document that exists in a different collection. This prevents data duplication, but you have to “join” the documents manually
  • For most situations, unless the document is massive, folks usually take the route of embedding it in whatever context it is needed.

Querying Collections

  • You can call the .find() method on a collection to grab all of the documents within the collection

  • Technically, in PyMongo, the .find() method returns an iterable, so you could loop over it:

    for document in db.|||collection|||.find():
      print(document)
  • If you just want a list of the documents: list(db.|||collection|||.find())

  • If you just want the first document: db.|||collection|||.find_one()

Fancier Querying

  • What if we want to filter or sort the documents / data we get back?

  • We can utilize a collection of query selectors to the .find() method to fine-tune what we get back

  • General form looks like:

    db.|||collection|||.find({|||key|||: |||filter_document|||})

    where the |||filter_document||| would either be a literal value or a document that specifies an specific query selector

Common Query Selectors

Name Description Example
$eq Matches values that are equal {'$eq': 5}
$gt Matches values that are greater {'$gt': 8}
$gte Matches values that are greater or eq {'$gte': 8}
$lt Matches values that are smaller {'$lt': 3}
$lte Matches values that are smaller or eq {'$lte': 3}
$in Matches values appearing in the array {'$in': [1,2]}
$ne Matches values that are not equal {'$ne': 'Jed'}
$regex Matches values that match the regexp {'$regex': '^J'}

Picking Fields

  • Just like we didn’t always want all the columns from a table, you may only want certain fields from the document

  • You can specify this with a second argument after the filtering argument in .find()

    db.|||collection|||.find({}, |||field_choice_document|||)
  • The document to choose fields is essentially a blacklist / whitelist of the desired fields

Black and White Fields

  • To indicate fields you want to keep, list the field name and a 1

    db.|||collection|||.find({}, {'name': 1, 'age': 1})
  • To indicate fields you do not want, just provide a 0 instead

    db.|||collection|||.find({}, {'birthday': 0, 'parents': 0})

    which will get you all of the fields except those indicated

  • You can never mix 0s and 1s except with the _id field that is automatically assigned when a document is inserted

Logical Operators

  • There are also logical operators to combine various operations

    Name Description
    $and Returns documents where both queries match
    $or Returns documents where either query matches
    $nor Returns documents where both queries fail to match
    $not Returns documents where the query does not match
  • Except for $not, each has a list of query operators following it

New Data Sneak Peek

More Station Info

  • A document database is coming online (also at sources.advde) which will contain a collection of information related to each station
  • This will include information like:
    • Zone
    • Underground/above ground
    • Turnstiles and service dates
    • Turnstile schedules (ingress / egress)
  • This will be a new data source that can be queried, snapshot, and incorporated into your data warehouse

But wait, there is more!

  • Additionally, all current taps (and all future taps) will receive an additional field: turnstile_id
  • Coupled with the station documents, this should give you the ability to uniquely determine when and where an individual enters a station and leaves a station.
  • This also means that the simulated ridership may become more complex:
    • Riders may start a new trip from a station different from the one they arrived at (they walked or caught some other transportation between)
    • Riders may travel from station to station (they won’t always travel to and from places in perfect pairs

Milestone 3

  • Milestone 3 will not be due until July 8th
  • It is going to be based around the “lost” business decision from earlier this semester
  • Official guide should be up this weekend
  • You have more time for this in part to allow for the fact that you have to be adapting to a new existing pipeline! Don’t wait until the last minute to start!

Group Work

Welcome to a new world!

  • You all have been officially moved to a new project at this point!
  • What stays the same:
    • Your VM domain name
    • Your login information
  • What changed:
    • What VM that domain takes you to
    • SSH keys to allow you access to that new VM
    • What S3 bucket you have access to when you log in to MinIO
    • What organization you have admin rights to when you log into Grafana
    • Any Discord notifications should automatically propogate to the correct channels

Learning your new system

  • The rest of the evening I have set aside for you to start learning the project you have inherited
  • Make sure you can connect to everything!
  • How are things structured and scheduled? Focus on big-picture questions first: don’t get mired in the details until you need to.
  • Read over all of their documentation! Don’t make the mistake of accidentally breaking something because you didn’t bother to read the information they gave you.
  • Each group will be getting a form early next week asking for your impressions of the documentation that was provided and its usefulness to your work.
// reveal.js plugins // Added plugins