---
title: "Changing Warehouses and Dashboards"
author: Jed Rembold
date: June 11, 2025
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
:::{style='font-size:.9em'}
- Approaching deadlines:
  - Tuesday night (June 17):
    - Milestone 1 dashboards due
  - Following Tuesday (June 24)
    - Milestone 2 dashboards due
    - Documentation due
    - Group swap happens after this
- Weekly reflections will **actually** be starting! I will post a template and source by end-of-day tomorrow.
:::


## Tonight
- Data Warehousing Continued
- VSCode Remote Connections
- Dashboarding with Grafana
- Work time

# Data Warehousing Continued

## Fact Table Review
- Fact tables hold **numeric** _facts_ at the desired grain
- A measurable business process tends to create events that populate a fact table
- Will have multiple foreign keys linking to various dimensional tables
- Usually will have a compound primary key comprised of a collection of the dimension keys

## Dimension Table Review
- Dimension tables hold descriptive information about a business event
- Often have many columns, but relatively fewer rows
- Has a single primary key (usually surrogate) that links to the fact table
- Primary source of query constraints, grouping, and labeling


## Creating a Dimensional Model
#. Facts model important business processes, so you need to start by **identifying the key business processes** that you are interested in analyzing
#. Once you have the processes down, work out what level of granularity of the events you might need for each
#. Then you can work out what facts you have (or can compute) from the underlying data that describe that process at the desired granularity
#. Identify what dimensions you have access to that can provide additional context
#. Create your star schema database model


## Example: Course Analytics
::::::cols
::::col
- We've been hired to build a data warehouse for a university to support academic performance reporting
- The dean wants to answer questions like:
  - How are students performing by course, instructor, and department?
  - What's the pass rate for 100-level vs 400-level courses?
  - How many credits are earned across departments each term?
::::

::::col
The university collects the following data:

- Student ID
- Course ID
- Instructor ID
- Term
- Grade received
- Credits attempted/earned
- Enrollment date
- Department

::::
::::::

## Dealing with Timestamps
- If your events have a timestamp associated with them, it can seem straightforward to just include that timestamp in the fact table
  - This is rarely done in isolation though!
- What role is the timestamp playing? Usually it would be for query slicing, and so instead should be broken up into dimensional tables.
  - If you just care about days, a date dimension would work fine
  - If you can about time of day, you could _also_ have a time dimension
- You **can** still include the timestamp in the fact table, but you would usually only ever do so if you felt a need to exactly reconstruct a sequence of events


## Activity: Patient Analytics
::::::{.cols style='font-size:.8em'}
::::col
- We’re working with a regional health system to help build a data warehouse focused on outpatient care. We’ve been given data on clinic visits and are tasked with designing a dimensional model to support executive and clinical reporting.
- Want to answer questions like:
  - What’s the average visit length by clinic or provider?
  - What’s the volume of visits by diagnosis and month?
  - What’s the revenue and collection rate by payment type?
::::

::::col
We have the following data:

- Patient ID
- Patient insurance information
- Clinic ID
- Visit date and time
- Attending provider (doctor, nurse practitioner, etc.)
- Diagnoses (ICD-10 codes)
- Procedures performed (CPT codes)
- Duration of visit (in minutes)
- Amount billed and amount collected
- Payment type (insurance, out-of-pocket, Medicaid, etc.)

::::
::::::

## A Wrinkle
- Suppose some patients in the previous activity changed their insurance information
  - How do we account for this in the patient dimension table?
- Highlights the importance of dealing with **slowly changing dimensions** (SCDs)
- These are examples of dimension attributes that might change with time (but usually don't change that often)
  - E.g. names, addresses, jobs, job titles
- We need consistent methods of accounting for these changes in our data warehouses
  - Generally broken down by types


## Type 0: Keep Original
- The first type of SCD is perhaps the most straightforward: do nothing
- Here if a value changes, you ignore the change, and just continue to store the original
- Clearly this is not always wanted, and only makes sense for particular attributes
  - Commonly attributes that might be representing the original state of something
    - E.g. `original_name`, `original_credit_score`, `orignal_signup_date`


## Type 1: Overwrite
- This type of SCD overwrites the original values whenever a change occurs
  - If someone's name changes, you just update the original dimensional row with the new name
- This is great for holding the current state of things, but it **erases the history**
- May need to ensure that any downstream consumers of the data are updated to reflect the change


## Type 2: Add a Row
- In this type of SCD, a new row is added whenever a value changes
  - This row would duplicate all other columns, but hold the updated value in the SCD column
- This would **invalidate natural primary keys**, as they would then be duplicated, so tables using this type must use surrogate primary keys
- Generally need to add 3 additional columns to the table as well:
  - row effective date (or timestamp) - when did this row become valid
  - row expiration date (or timestamp) - when did this row become invalid
  - current row indicator - a flag (boolean or "Y/N") indicating if this row is active _now_

## Type 2 Checklist
- Any time a value changes in a Type 2 SCD, you need to:
  - Add a new row and generate a new surrogate primary key for that row
  - Populate that row, including the row effective and current row columns (expiry will be `NULL`)
  - Update the previous row, adding a row expiry and changing the current row columns
  - Ensure that newer rows in your fact table link to the newly created dimensional row (use the correct foreign key)

## Type 3: Add a Column
- This type of SCD adds a new column or attribute to the dimension table when a value changes
- Generally, the new value overwrites the original column, and the old value is moved to the new column
- Sometimes called an _alternate reality_, as it becomes easy to filter or group by either the current state of things or the older reality
- This quickly becomes unwieldy if the dimensions are changing very often, and so is perhaps the least used of these types

## More Types
- There are more types as well:
  - Type 4: Adding mini-dimensions
  - Type 5: Mini-dimension with Type-1 outrigger
  - Type 6: Adding Type-1 attributes to Type-2 dimension
  - Type 7: Dual Type-1 and Type-2 dimensions
- But these are mostly combinations or variants of the above
- Being comfortable with Types 0 - 3 is generally sufficient


## Example:
- Suppose we have a source `users` table that tracks current username, address, and birthplace
- Such a table would likely be translating into some form of dimension table in the warehouse
- How might the dimension table change for any of the available types should one of the columns change?


## Dimensional Design Worksheet
![](../images/dimension_design_worksheet.excalidraw.png)


# Dinner Break!

# Other Remote Methods

## Limitations of the Terminal
- While the terminal can be incredibly powerful, you may find it cumbersome to work with in some instances
  - Terminal text editors can take some tweaking to look nice (I should post some guides on this)
  - Viewing things like images in a terminal can be very cumbersome
- As such, in some cases, it can be useful to use a graphical tool that can leverage the SSH connection
- There are many tools that vary wildly across OS, but one useful cross-platform one that many will already have is VSCode

## Visual Studio Code with SSH
- The easiest way to get running with SSH in VSCode is through the "Remote - SSH" extension, officially published by Microsoft
  - If it doesn't come along automatically, the "Remote Explorer" extension is also recommended
- On the left-side menu, you'll see an icon for "Remote Explorer" now
- Clicking on it, you'll see a selection of all the profiles you have set up in your SSH config. Just click on the one you want to access!
- Once it opens, you probably want to click File, then Open Folder, and then select the home folder on the remote system!
- Now you can see everything!


## More!
::::::{.cols style='align-items: center'}
::::col

- Other nice things about the VSCode integration is that it allows:
  - Viewing images!
  - Previewing markdown!
    - Basic rendering in possible by default, can add extensions for more things like Mermaid support, qmd files, etc

  - Opening a terminal!
::::

::::col

![](../images/vscode_ssh_example.png)

::::
::::::


# Dashboarding with Grafana

## Why Dashboards
- Depending on your role in the company, you may just be responsible for creating and maintaining the data warehouse, or you might have to go beyond to deliver on the actual business questions
- Requires a distilling of the important information out of the warehouse and into a form that non-technical people can appreciate and understand
  - If you have made your warehouse well, this should be fairly easy
- Reports and dashboards are among the most common tools that you might use to convey this information

## Grafana
::::::cols
::::col
- Open-source, versatile platform for querying, visualizing, and alerting from various data sources
- Creates interactive and customizable dashboards
- Dashboards can be shared within teams or externally
- Data for dashboards queried directly from connected data sources (like the data warehouse!)
::::

::::col
![](../images/grafana_intro.png)

::::
::::::

## Accessing Grafana
- Like MinIO, a single Grafana server is running that you all will connect to
  - Accessible at `grafana.advde` on port 3000
- You access Grafana through a web interface, so this is one more tunnel you will likely want to create!
- Each project will belong to its own organization, which will give you full control over creating your data connections and dashboards
  - A single organization admin account has been created for each project
  - Group login information is the same as for MinIO


## Data Sources
- Grafana stores no data, it only accesses it from other various sources
- An **admin** account needs to set this up:
  - In the left menu, go to Connections
  - Search for Postgres (assuming you want to connect your warehouse) and select it
  - In the upper right click "Add new data source"
  - Populate the necessary portions with the data to connect to your data warehouse
    - Should be host, database name, user, and password
  - Under Additional Options, set the Postgres version to the latest possible
  - Hit "Save and Test" at the bottom. If you get the positive pop-up dialog, you are good to go!


## Creating a Dashboard
- Click on the Dashboard section in the left menu
- Click the giant "Create Dashboard" button, and then "Add Visualization"
- Select your Postgres data source
- Create your visual!
- You can then go back to the dashboard to create more visuals
- You can arrange and size visuals on the dashboard
- Make sure you save the dashboard!


## Making a Visual
:::::{.cols style='align-items: center'}
::::col
- Generally, you'll spend your time in two areas making a visual:
  - In the query builder at the bottom of the screen. 
    - Here you can either construct a query from the GUI, or, because you are all competent in SQL, from the query tool directly
  - In the configuration menu to the right of the screen
    - Here, you can choose both the type of visual, as well as all the customizable aspects of that visual
::::
::::col
![](../images/grafana_visual_example.png)
::::
:::::

## Working with Times

- Grafana dashboards generally have a time window indicator provided near the upper right
- By default, changing this will just change the _scales_ of your axes
- It can be useful to have changes to this value dynamically affect your SQl queries, which you can do by using the variables below

:::{style='font-size:.8em'}

| Variable                | Description                                                   |
|-------------------------|---------------------------------------------------------------|
| `$__timeFrom()`         | The start of the time range                                   |
| `$__timeTo()`           | The end of the time range                                     |
| `$__timeFilter(dt_col)` | Use with a `WHERE` to automatically add the `BETWEEN` content |
| `$__interval`           | A suggested bucket size give the current time range           |

:::


## Variables
- You may want to add a little interactivity to your dashboard!
  - Maybe a dropdown, or a text field for filtering?
- You can define dashboard _variables_ that you can then use in your SQL queries to make things dynamic
- While editing the dashboard (not a specific visual), go to Settings in the upper right
- Choose the variables tab and add a new variable
- Choose a type, give it a name, and fill in any other information
- In a visualization query editor, you can refer to that value with `'$variable_name'`
  - Note that it needs to be in a string if it gives text values!



# Your Time

## To be Working on!
- I am giving you the rest of the evening to work with your groups to ensure you will make next Tuesday's business decision deadline
- As a reminder of what you need by that point:
  - Airflow taking nightly snapshots into data lake
  - Designing your data warehouse for the initial business questions
  - Moving data from the lake into a data warehouse on some timetable
  - Creating dashboards that read from the warehouse and provide context and meaning to the business questions
- You should absolutely be working on documentation while you do this, but it will **not** be due next Tuesday (June 17)

