---
title: "Window Dimensions and Document DBs"
author: Jed Rembold
date: June 25, 2025
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
:::{style='font-size:.9em'}
- The swap happened! You'll have some time to check things out with your group later tonight
- Next Milestone deadline will be in 2 weeks, so you have some time to learn your newly inherited project
- I am aiming to have the new data source live and the milestone posted by this weekend
- Don't forget your weekly reflections!
:::


## Tonight
- Alternative ways to create dimension/fact tables
- Introducing Document Databases
- A new data source?
- Work time

# Windowing Dimensions

## Lake 2 Warehouse: 2 approaches
- If your data warehouse is being populated directly from source data, you essentially always have to update it "in-place"
- But if you have a data lake of snapshots, you have some options:
  - Populate the warehouse incrementally, pulling from each day's snapshot
  - Recreate the warehouse tables from scratch each time, pulling from all the snapshots enmasse


## Incremental Loads
::::::cols
::::col
:::{.block name=Pros}
- Very efficient: only working with 1 day's data at once
  - If you are working with massive amounts of data, that could be very important
- Incremental changes "fit" easier with the concepts of Type-2 SCDs
:::
::::

::::col
:::{.block name=Cons}
- Non-trivial to determine what rows need to change or be updated
- Repopulating the warehouse means backfilling across each incremental daily load
:::

::::
::::::

## Recreating from Scratch
::::::cols
::::col
:::{.block name=Pros}
- A single run populates the entire dimensional model
- Looking at the entire history at once, so easier to determine what rows changed when
:::
::::

::::col
:::{.block name=Cons}
- Requires a good understanding of window functions to achieve easily
- "Wasteful" in the sense that much of the dimension table creation is likely being duplicated work from what existed before
:::
::::
::::::

## Recreation Requirements
- In order to regenerate a dimension (or fact) table from scratch you need:
  - The historical record of snapshots
  - Knowledge of when the snapshot happened
    - This could be written to a column or in the snapshot name itself
  - A tool that can read in _globbed_ files (DuckDB works very nicely for this!)


## Reading in the Snapshots
- DuckDB can accept BASH _globs_ to expand out a pattern into a list of files
- Usually this would utilize a wildcard character `*` to represent "some amount of anything"
  - For example, `*.parquet` would expand to represent all of the parquet files in a given folder
- Reading in a bunch of snapshot parquet files into DuckDB is then as easy as:
  ```sql
  SELECT *
  FROM 's3://your-bucket/*.parquet'
  ```
- This essentially reads in each parquet file to its own table, and then `UNION`s them all together


## Reading Options
- Sometimes it is useful to know what file each row came from, if you want that, then you need to use the full `read_parquet` function, and add a `filename=true` parameter
  ```sql
  SELECT *
  FROM read_parquet('s3://your-bucket/*.parquet', filename=true)
  ```
- For snapshots, your schema are likely the same, so the union is obvious. But what if your parquet files have slightly different schema?
  - Likely want to add the `union_by_name=true` option


## Steps: from history to dimension
- Once you have all your historical data read in with a column of when each snapshot happened, you can proceed to create your dimension table
- Generally looks something like:
  - Filter down to only the data you want in your dimension table, if necessary
  - Add a hashed column representing all of the data that might change which you want to track
  - Identify when a given row changed by using `LAG`, partitioning by your row identifier and ordering by the snapshot times
  - Filter down to only keep the original rows or the rows where a change happened
  - USE `LEAD` to get when the _next_ version would have started, subtracting a day to get when that rows version ended
  - Select out just the columns you want for your final dimension table

## Hashing Rows
- Often we need an easy way to compare the contents of many columns
- Different approaches, but one common one is to concatenate all the data as text, and then run a _hash_ function on the row
  - DuckDB can use both the `md5()` or `hash()` hash functions
- Might look something like:
  ```pgsql
  SELECT pid, fname, age, hash(fname, age) AS current_hash
  FROM |||mytable|||
  ```

## When data changes
- To identify when data changes, we `LAG` the hash so that we can compare one rows hash to the previous
  - Usually call it something like `prev_hash`
  - Same hash (`current_hash = prev_hash`)? Then no change!
- To compare the appropriate rows, we need to partition by whatever is the unique identifier of a record in that dimension table
  - This is **not** the primary key, which uniquely identifies every **row** (and which we are not concerned with at the moment)
  - These are identifiers for what represents a given piece of information (pid, station_id, etc)
- Remember that ordering matters! So you want to order by the snapshot date so that lagging a row compares each row to the _next_ snapshot


## Keeping Originals and Changes
- You likely have many snapshots where little or no data actually changed
- You don't care about those for your dimension table, you only care about the original data and whenever that data changed
- Filter your table from the last step to only include rows where:
  - The previous hash is `NULL` - this implies the original entry of that row
  - The current hash is different from the previous hash - this implies that the data changed at that point

## Determining the Effective and Expiry dates
- The effective date for a given row at this point is just its snapshot date: when it entered the system
- To get the expiry date, you can `LEAD` the snapshot dates to see when the _next_ version's snapshot began
  - Then subtract 1 from this to showcase that the current row expired one day before the next version began
    - If taking snapshots more often than once a day, adjust the interval to subtract accordingly


## Trimming down and adding is_active
- As a last step, you would usually just select out the columns that you want in your final dimension table (none of the hash stuff)
- Technically, you could always find the current, most recent entries by filtering for where `expiry_date IS NULL`
- If you want something like an `is_active` column, you could add such a column as a boolean easily with
  ```pgsql
  SELECT pid, fname, age, effective_date, expiry_date, 
    expiry_date IS NULL AS is_active
  FROM |||last steps table|||
  ```

## Tweak as needed
- This process _should_ work for the vast majority of ways that your data could be changing
- Think of this as a process though, not a recipe
  - Important that you understand the **why** of each step, so that you could adapt it to a slightly different situation as necessary
- At the end of the day, you can create your dimension tables however you please: I am just trying to make you aware of options


# Document Databases

## Moving Beyond RDBS
- We have spent a lot of time focusing on relation databases for good reason: they are very popular and capable!
- They do have their drawbacks though, and for that reason other database models exist
- The next several weeks we are going to spend learning the basics of how to access these models:
  - Mostly focusing our efforts around access and querying, assuming that such a database has already been created
  - We will not get into as much how to model or administrate these sorts of databases
- Initially we will be starting with _Document Databases_


## The Document Model
::::::cols
::::col
- Sometimes referred to generally as NoSQL to highlight its differences from relational models
- Best for situations where data exists largely in self-contained documents and outside relationships are fairly rare
- Major examples: XML, JSON, MongoDB
- One-to-many relationships give rise to a tree structure

::::

::::col
![](../images/document_tree.svg)

::::
::::::


## The Document Model Pros/Cons
- Strength of the document model:
  - It handles one-to-many relationships very well
  - More closely mimics object-oriented programming, so less impedance mismatch
  - Schema are generally quite flexible: not every record has to have the exact same information
  - Data is generally stored together or close to related data, so lookups can be fairly efficient

- Weaknesses of the document model
  - Struggles more with many-to-many relationships
  - Joins are more difficult, and possibly need to be done entirely outside the database in some cases


## MongoDB
::::::cols
::::col
- The variant of document database that we will focus on is called MongoDB
- [Pretty handily](https://db-engines.com/en/ranking_trend/document+store) the market favorite as far as document databases go
- Open-source, with local deployment options and free cloud deployment options through MongoDB Atlas
::::

::::col
![](../images/mongodb_logo.svg)

::::
::::::

## Document DB Basics
- The equivalent of a "table" in a document database is called a _collection_
- Collections are populated by _documents_, where a document is comprised of an assortment of keys and corresponding values
  - Keys need to be unique within a document
  - The corresponding values may, themself, be another document. Leading to a nested, hierarchical structure
  - Documents do **not** need to have the exact same keys, even if they are part of the same collection


## Document Syntax
::::::cols
::::col
![](../images/document_json.png)

::::

::::col
- Different document databases handle exactly how they comprise and store their keys and corresponding values in different ways
- MongoDB uses the very known and popular JSON framework though, so this is pretty straightforward
- Each document is a valid block of JSON keys and corresponding data

::::
::::::

## MongoDB Interactions
- Unlike most RDBMS which use some variant of SQL to interact with the database, document databases usually have more bespoke interfaces
  - Usually ported to a wide variety of languages though
- For MongoDB, those of most interest are likely:
  - Mongosh: MongoDB's cli interface client (similar to `psql` for Postgres)
  - PyMongo: The Python library connecting and querying Mongo databases
  - MongoLite: The most recent R library for working with Mongo databases
- The syntax between these is **very** similar


## Making Connections
- The primary way to connect to a Mongo database is through a connection string, **extremely** similar to how it worked with Postgres
- The general form of the string is:
  ```pgsql
  mongodb://|||username|||:|||password|||@|||host|||:|||port|||/|||database|||
  ```
- In Python then:
  ```python
  from pymongo import MongoClient

  client = MongoClient(
    "mongodb://jed:mypswd@localhost:27017"
  )
  db = client['database']
  ```

## Accessing a Particular Database
- Once you are connected, you can see what databases are available:
  ```python
  print(client.list_database_names())
  ```
- And then you can switch to that database with:
  ```python
  db = client.get_database(|||database name|||)
  ```
- To see the available collections within the database:
  ```python
  print(db.list_collection_names())
  ```

## Inserting Documents
- You can insert documents one at a time:
  ```python
  db.|||collection|||.insert_one(
    {"name": "Jed", "teaching": ["DATA510", "DATA599"]}
  )
  ```
- Or many at once:
  ```python
  db.|||collection|||.insert_many(
    [{"name": "Jed", "teaching": ["DATA510", "DATA599"]},
    {"name": "Rachel", "teaching": ["DATA510", "CS549"]}]
  )
  ```

## Normalization
- What if you wanted to insert the same document into two different collections (or parent documents)?
- Two main methods arise:
  - _Embedding_: you just copy the contents of the document into each of the various places. Note that this results in data duplication, but everything is self contained
  - _Referencing_: you insert a reference to the ObjectID of a document that exists in a different collection. This prevents data duplication, but you have to "join" the documents **manually**
- For most situations, unless the document is massive, folks usually take the route of embedding it in whatever context it is needed.


## Querying Collections
- You can call the `.find()` method on a collection to grab all of the documents within the collection
- Technically, in PyMongo, the `.find()` method returns an iterable, so you could loop over it:
  ```python
  for document in db.|||collection|||.find():
    print(document)
  ```
- If you just want a list of the documents: `list(db.|||collection|||.find())`{.python}
- If you just want the first document: `db.|||collection|||.find_one()`


## Fancier Querying
- What if we want to filter or sort the documents / data we get back?
- We can utilize a collection of _query selectors_ to the `.find()` method to fine-tune what we get back
- General form looks like:
  ```python
  db.|||collection|||.find({|||key|||: |||filter_document|||})
  ```
  where the `|||filter_document|||` would either be a literal value or a document that specifies an specific query selector

## Common Query Selectors

:::{style='font-size:.9em'}

| Name         | Description                           | Example            |
|--------------|---------------------------------------|--------------------|
| `$eq`        | Matches values that are equal         | `{'$eq': 5}`       |
| `$gt`        | Matches values that are greater       | `{'$gt': 8}`       |
| `$gte`       | Matches values that are greater or eq | `{'$gte': 8}`      |
| `$lt`        | Matches values that are smaller       | `{'$lt': 3}`       |
| `$lte`       | Matches values that are smaller or eq | `{'$lte': 3}`      |
| `$in`{.text} | Matches values appearing in the array | `{'$in': [1,2]}`   |
| `$ne`        | Matches values that are not equal     | `{'$ne': 'Jed'}`   |
| `$regex`     | Matches values that match the regexp  | `{'$regex': '^J'}` |

:::

## Picking Fields
- Just like we didn't always want all the columns from a table, you may only want certain fields from the document
- You can specify this with a second argument after the filtering argument in `.find()`
  ```python
  db.|||collection|||.find({}, |||field_choice_document|||)
  ```
- The document to choose fields is essentially a blacklist / whitelist of the desired fields

## Black and White Fields
- To indicate fields you want to keep, list the field name and a 1
  ```python
  db.|||collection|||.find({}, {'name': 1, 'age': 1})
  ```
- To indicate fields you do **not** want, just provide a 0 instead
  ```python
  db.|||collection|||.find({}, {'birthday': 0, 'parents': 0})
  ```
  which will get you all of the fields except those indicated
- You can never mix 0s and 1s except with the `_id`{.text} field that is automatically assigned when a document is inserted


## Logical Operators
- There are also logical operators to combine various operations

  | Name          | Description                                        |
  |---------------|----------------------------------------------------|
  | `$and`{.text} | Returns documents where both queries match         |
  | `$or`{.text}  | Returns documents where either query matches       |
  | `$nor`{.text} | Returns documents where both queries fail to match |
  | `$not`{.text} | Returns documents where the query does not match   |

- Except for `$not`{.text}, each has a list of query operators following it


# New Data Sneak Peek

## More Station Info
- A document database is coming online (also at `sources.advde`) which will contain a collection of information related to each station
- This will include information like:
  - Zone
  - Underground/above ground
  - Turnstiles and service dates
  - Turnstile schedules (ingress / egress)
- This will be a new data source that can be queried, snapshot, and incorporated into your data warehouse


## But wait, there is more!
- Additionally, all current taps (and all future taps) will receive an additional field: `turnstile_id`
- Coupled with the station documents, this should give you the ability to uniquely determine when and where an individual enters a station and leaves a station.
- This also means that the simulated ridership may become more complex:
  - Riders may start a new trip from a station different from the one they arrived at (they walked or caught some other transportation between)
  - Riders may travel from station to station (they won't always travel to and from places in perfect pairs


## Milestone 3
- Milestone 3 will not be due until July 8th
- It is going to be based around the "lost" business decision from earlier this semester
- Official guide should be up this weekend
- You have more time for this in part to allow for the fact that you have to be adapting to a new existing pipeline! Don't wait until the last minute to start!


# Group Work

## Welcome to a new world!
- You all have been officially moved to a new project at this point!
- What stays the same:
  - Your VM domain name
  - Your login information
- What changed:
  - What VM that domain takes you to
  - SSH keys to allow you access to that new VM
  - What S3 bucket you have access to when you log in to MinIO
  - What organization you have admin rights to when you log into Grafana
  - Any Discord notifications should automatically propogate to the correct channels

## Learning your new system
- The rest of the evening I have set aside for you to start learning the project you have inherited
- Make sure you can connect to everything!
- How are things structured and scheduled? Focus on big-picture questions first: don't get mired in the details until you need to.
- Read over all of their documentation! Don't make the mistake of accidentally breaking something because you didn't bother to read the information they gave you.
- Each group will be getting a form early next week asking for your impressions of the documentation that was provided and its usefulness to your work.