---
title: "SQL Utilities"
author: Jed Rembold
date: April 16, 2025
slideNumber: true
theme: nord_light
highlightjs-theme: nord
width: 1920
height: 1080
transition: slide
---


## Announcements
- HW10 due tonight (or as soon as you can get it in) and then you just have the project!
- I continue to be terribly behind in my grading for this class, and for that I sincerely apologize. I should be able to make meaningful progress in the coming week.
- For next week, we just have the presentations planned. I'll put together and post an order soon.
  - You can set your Grafana instance to "Serverless" as well to save on credits!
  - Make sure you are meeting the requirements on the project guide and scoring rubric!
  - The supplementary GitHub repo will be available soon. (I'll get it available tonight and post an announcement)


<!--------- SERVING ------------->
<!--
## Serving it up hot
- The last step of the data engineering pipeline is to make the data available to others, commonly called _serving_
- There are many ways this could be achieved, but often might look like:
  - Moving data into other shared databases
  - Making data available via an API
  - Displaying data in live dashboards for decisions makers


## Serving Options
- For the project, I see largely two options here
  - Many of you have been taking PowerBI this semester. You could make a dashboard and link it to your database
  - I have created a Docker container to create a simple API for you, given a few customizations
- If you have a strong desire for something else, I'm open to your ideas. But it should be something that will reflect a live view of the data.


## DB2API
- The DB2API container is similar to that used for your data transformation, in that is is best built with slight customizations.
- You can access the template repository [here](https://github.com/rembold-data-engineering-master/db_serving_api_template), from which you should make your own repo.
- For an extremely simple API, you should just need to edit the `endpoints.yaml` file, which is a plain-text format.
  - Syntax is: `/endpoint`: `SQL QUERY`
- Try to construct queries that return a reasonable amount of information at a time.

## Getting fancier
- The API is written using Python's FastAPI library, which is really nice to use if you have any Python background
- You can add more customized API endpoints by adding functions to the bottom of the `db2api.py` file
- Extra options:
  - Endpoints with a parameter in them: `/cities/{pagenum}`
    - Add the parameter to the defined function as well
    - You can use the parameter in your SQL query with `:pagenum`
  - Endpoints with optional query parameters: `/cities?state=OR`
    - Add the parameter to the defined function with a default value

## Deployment
- Within your Railway project, add a new GitHub Repo, and choose the one you made
- The API needs to be able to connect to your database, so add the `DATABASE_URL` environment variable to it, similar to how you did with the scraper.
- Deploy the container.
- Go into the container settings and find the part about Public Networking. Hit the "Generate Domain" button to get a URL on Railways servers. This is your new API root endpoint!
- Need some documentation? It is built in! Just add `/docs` to the root endpoint.


## Project Serving
- Try to serve up the data from your project that you think might be most useful
- If an API, what endpoints make the most sense?
  - In what form might someone most want the data? You don't need to just make an endpoint for every single table, nor is that probably a great idea.
- If a dashboard, what are the main insights you want your dashboard to display?
  - Don't just throw numbers on the screen. Make it cohesive and telling a story.
  - If you are making a dashboard, please ensure that you can share it with me in a way that I can actually access and view it! Send me a test run beforehand to check.
-->

<!----------- SQL --------------->

## DRY
- A common programming term is that of _DRY_: Don't repeat yourself
  - For convenience sake (why redo work you have already done?)
  - For debugging/maintenance sake (in case of a change, each replicate needs to be found and changed)
- Want to look at several methods in SQL that we can utilize to better adhere to DRY principles
  - Views
  - Custom functions
  - Triggers


# A Splendid View
## Views
- A view in SQL is essentially a saved query 
- Existing views can be treated just like tables, selecting from their contents
  - You can even, to a limited extent, add, remove, or update contents
- Syntax similar to creating a new table, just with `VIEW`
  ```pgsql
  CREATE VIEW |||view name||| AS (
    SELECT |||colnames|||
    FROM |||tablename|||
  )
  ```

## View vs Table
- Despite seeming similar, views are different from tables
  - A view stores no new internal information in the database, whereas creating a new table copies that information
    - This means a view will always reflect the latest data, whereas a created table would need to be updated
  - Technically, a view runs the original query each time it is accessed. This makes them less ideal for computationally heavy operations
- Views can be a way to give users access to only particular columns of a larger table
  - This can be a boon for security and giving only those who need them permissions to change various tables


## View Operations
- You can drop views just like tables: `DROP VIEW |||view_name|||;`
- You can create or _replace_ views if the original number of columns and types are still present: `CREATE OR REPLACE VIEW |||view_name||| ...`
  - Avoids having to drop first to recreate
  - New view could have more columns, but can not have less
- Insertions, updates and removes can be made on **simple views**, which will propogate back to the original table!
  - Views must reference single tables (no joins) and have no distinct or group by type clauses
  - Data in the original table **not** present in the view will be given `NULL` values


## Limiting Views
- Can also add `WITH LOCAL CHECK OPTION` to the end of a view creation query
- This will enforce that only data visible within the view can be edited with inserts, updates or removes
  - This includes any filtering done by the view!
- Can be an excellent way to limit what data can be changed
- Can replace `LOCAL`{.text} with `CASCADED` if you are referencing several nested views and want the restrictions to apply based on all the views' restrictions


## API Views
- PostgREST will serve up endpoints for views as well as tables!
- This can make them an _excellent_ option for collecting together information from your normalized tables that you'd like to offer up to the API user
  - Since the view always showcases the lastest data, you don't need to worry about keeping in sync
- Just create the view in your API schema, referencing your other tables
  - Don't forget to give the anonymous user account permissions to `SELECT` from it after you do!

## Activity!
- I've given you a simple starter bit of sql [here](../activity_data/boxes.sql) to create and populate some tables. Use them to practice the below in the shown pairings.
- Tasks:
  - Create a view called `newest` that will only show the "new" boxes from `best_boxes`
  - Create a view called `newred` that will only show new and red boxes using only `newest`
  - Adjust the `CHECK OPTION` setting for both views so that you could add `('Jane', 2, 7, 9, 'red', 'new')` to `newred` but not `('Bart', 6, 10, 12, 'red', 'used')` to `newred`

# Break!
## Break Time!
![](https://media2.giphy.com/media/26tP4mS08o114G2kg/giphy.gif?cid=ecf05e47lyobwhi0b3chbob5f2y4biam9e28qeku3jhralxd&rid=giphy.gif&ct=g)


# Functional SQL
## Functions
- Postgres has many prebuilt functions that we have used over the semester
- Sometimes you need something a bit more bespoke though, for which Postgres gives you some "easy" methods of writing your own functions
- What language you use to write your own function can be exceptionally flexible:
  - Pure SQL
  - PL/pgSQL is an extension of SQL that offers more programming logic
  - Python
- All functions will follow a similar structure, but the body will depend on the details of your chosen language
  - Documentation for all the languages (and functions in general) can be found [here](https://www.postgresql.org/docs/current/server-programming.html)


## Structure of Function
- The structure of any defined function will generally look like:
  ```pgsql
  CREATE FUNCTION |||function name|||(|||function arguments|||)
  RETURNS |||return type||| AS
  |||function body string|||
  LANGUAGE |||function lang|||;
  ```
- `|||function name|||` - The desired name of the function
- `|||function arguments|||` - A comma separated list of paired arguments and types
- `|||return type|||` - The variable type of what will be returned (selected)
- `|||function lang|||` - The language used for the function body
- `|||function body string|||` - a **string** containing the query or logic to process whenever the function is run


## In SQL
- `LANGUAGE SQL`
- Using the basic SQL language gives you access to any existing SQL queries that you already know
- To return something you should use the `SELECT`
- You can also write functions that do **not** return something by specifying `RETURNS void` and then using `INSERT`, `UPDATE`, `DROP`, etc.
- Using `$$` to quote the body is recommended so that you do not need to double up any single quotes


## SQL Example
- Suppose we wanted to define a function to return us the difference between two timestamps in hours

  ```pgsql
  CREATE FUNCTION diff_in_hours 
    (start_ts TIMESTAMP, end_ts TIMESTAMP)
  RETURNS FLOAT AS
  $$
  SELECT date_part('epoch', end_ts - start_ts) / 60
  $$
  LANGUAGE sql;
  ```


## In plpgSQL
- `LANGUAGE plpgsql`
- Extends SQL to add classic programming control structures
- Needs to be wrapped in `BEGIN`, `END` keywords
- Operates like a hybrid between SQL and a more typical programming language
- Each phrase needs a terminating semicolon at the end
- More flexible than pure SQL, but has additional syntax to learn


## plpgSQL Structures
- You can define variables
  ```{.pgsql style='font-size:1em'}
  x := 56;
  ```
- Control statements
  ```{.pgsql style='font-size:1em'}
  IF x > 5 THEN
    |||...do something...|||
  ELSE
    |||...do something else...|||
  END IF;
  ```
- Returns
  ```{.pgsql style='font-size:1em'}
  RETURN x;
  ```

## plpgSQL Example
- Suppose we wanted to define a function to return us the same difference between two timestamps in hours, but now in pgsql

  ```{.pgsql style='font-size:.9em'}
  CREATE FUNCTION diff_in_hours_plpgsql 
    ( start_ts TIMESTAMP, end_ts TIMESTAMP )
  RETURNS FLOAT AS
  $$
  BEGIN
  IF end_ts > start_ts THEN
    RETURN date_part('epoch', end_ts - start_ts) / 60;
  ELSE
    RETURN NULL;
  END IF;
  END;
  $$
  LANGUAGE plpgsql;
  ```

## In Python
- Python is another supported language in which functions can be written
- Requires an extension to be active:
  ```pgsql
  CREATE EXTENSION plpython3u;
  ```
- Having the extension available might depend on how your database was installed
- Most common data types with corresponding types are converted, anything else (like timestamps!) are just utilized as strings
- I'd generally write the Python in a more Python-suited editor and then copy it in.

## Activity!
:::{style='font-size:.9em'}
- Returning to the boxes tables you were working with earlier:
  - Write a function named `box_volume` to compute the volume of a cubic object, and use that function to populate a new `volume` column in `best_boxes`
  - Write a function called `scale_boxes` which updates the `best_boxes` table to scale the value in the named `width`, `height`, or `depth` columns by the provided argument. Run it to ensure it is working.
:::


# Obligatory Trigger Warning
## Triggers
- Triggers are contingencies that you can set up so that when a certain event occurs, a particular function is run
- Commonly used to update or change values in other tables when something is newly added or changed in a table
- Requires two things:
  - A defined trigger function
    - Can't use the pure SQL language, plpgsql probably best
  - A defined trigger itself


##  The Trigger Function
:::{style='font-size:.9em'}
- Initially going to avoid inputs for complexity reasons
  - Technically, this makes the function what Postgres calls a _procedure_
  - In truth, you _can_ have inputs to better generalize and potentially reuse trigger functions, but we won't cover that here.
- You still likely want a way to get information about the inserted/changed/deleted rows
  - The general names `NEW` and `OLD` will refer to the newly changed or previous table values (respectively)
- The function needs to return a special `TRIGGER` type
- What you return will depend somewhat on what type of trigger you are making
  - Only really matters for triggers firing _before_ an event
    - `UPDATE` or `INSERT` triggers should return `NEW`
    - `DELETE` triggers should return `OLD`
:::

## The Trigger Itself
:::smaller
```pgsql
CREATE TRIGGER |||trigger name|||
AFTER UPDATE
  ON |||table_name|||
FOR EACH ROW
EXECUTE FUNCTION |||func_name|||();
```
- Need to give the trigger a name
- Specify _when_ it will trigger
  - Could be _before_ or _after_ a particular event
  - Specifies what table will be "watched"
- Specify how many times it will trigger
  - Could be for each row affected by the event, or just once for the entire event
- Specify what function will be run (the earlier defined trigger function)
:::

## When to Trigger
- You can listen for `INSERT`, `UPDATE`, or `DELETE` events that affect a particular table
- Can trigger **before** the event occurs
  - Would be before any conditions are checked or changes made
  - What is being returned from your trigger function is important here!
- Can trigger **after** the event occurs
  - After all conditions are checked and changes have been made
  - What is returned from the trigger function has no bearing here
- Technically can also trigger `INSTEAD OF` the event occurring, but we won't cover that here


## Fine-tuning Triggering
- In some cases, triggering off of _any_ update or insert is too broad
- Can localize to just a single (or few) columns by using `OF colname1`:
  ```pgsql
  AFTER UPDATE OF |||column_name₁|||, |||column_name₂|||
  ```
- Can localize to only triggering off certain rows by inserting a `WHEN` condition before the `EXECUTE` line
  - Can use `NEW` and `OLD` within this condition
  ```pgsql
  WHEN OLD.|||name||| != NEW.|||name|||
  EXECUTE FUNCTION |||myfunc|||();
  ```


## How much triggering
- Using `FOR EACH ROW` will cause the trigger function to run once for _every_ row that was a part of the triggered event
  - Run an `UPDATE` that affected 10 rows? The trigger function will be run 10 times, once on each row
  - Ensures you will have access to the `NEW` and `OLD` variables in your trigger function
- Using `FOR EACH STATEMENT` will cause the trigger function to run _once_ when the triggering event occurs
  - Run an `UPDATE` that affects 10 rows? The trigger function just runs once.
  - A statement altering 0 rows will **still have an effect**!
  - Statement level triggers do not have a way to examine individual modified rows using `OLD` or `NEW`


## Extra Considerations
- You can have multiple triggers acting off a single event
  - In this case, Postgres processes them in alphabetical order
- It is possible to set up cyclical triggers, where Trigger A makes a change that triggers B, which makes a change that then triggers A again, etc
  - Postgres does not have built-in tools to prevent this, and it is up to you not to trigger your way into bad situations!

<!--
## Options
- You can trigger things after _or before_ updates, inserts, removals, etc
  - `AFTER INSERT`, `BEFORE REMOVE`, etc
- Can run once for each row: `FOR EACH ROW` or just once for the entire statement `FOR EACH STATEMENT`
-->

## Activity!
- You are very worried about the data within `best_boxes` from our earlier activity, such that you want to keep a log of any name changes that might occur. The table `name_changes` was already created for you from the earlier activity.

  - Write a function and corresponding trigger that, upon changing a `name` from `best_boxes`, will add a row to the `name_changes` table, where the operation would be `'update'` in this case. The `changed_at` column should be set to the current timestamp. Test it by changing 'Bob' to 'Bobby'.
  - Write another function and corresponding trigger that will add a row to `name_changes` whenever a new row is added to `best_boxes`. In this case, the operation would be `'insert'` and the `og_name` would be `NULL`.
  - (If time) Add a trigger so that whenever a dimension is altered (like, from your scaling function), the volume is automatically recomputed.