---
title: "Windows into Transformation"
author: Jed Rembold
date: March 12, 2025
slideNumber: true
theme: nord_light
highlightjs-theme: nord
width: 1920
height: 1080
transition: slide
---


## Announcements
- Homework 7 is due next Wednesday night (coming out this evening)
- My grading quest continues
- Project Pairs/Groups were sent out and Project Guidelines shared!
  - Some good target deadlines would be to:
    - Have your scraper up and scraping by this weekend
    - Have a transforming/cleaning container up and running by the end of next week (to be introduced later today)

# Getting in the Zone
## Time Zones
- Dealing with time zones can be a headache, and it is a very nice feature that Postgres can work with them smoothly
- By default, Postgres will _display_ any timestamp with a time zone with the time as you would measure it in your current system timezone
- What is your current system timezone?
  - `SHOW timezone;`
- Getting general information about timezones:
  - Getting abbreviations:
    - `SELECT * FROM pg_timezone_abbrevs;`
  - Getting full names:
    - `SELECT * FROM pg_timezone_names;`


## Teleportation
- It can sometimes be useful to switch your "current" time zone
  - Maybe it is easier to compare times to someone else living in that time zone
- Several methods to make the switch:
  - Change your `postgressql.conf` file, which controls your Postgres server. Only recommended if you have permanently moved elsewhere and the database time zone has not updated appropriately.
  - Set future queries in a single session to be from a new timezone:
    
    `SET |||timezone||| TO |||time_zone_name_or_abbrv|||;`
    - This will also adjust what values your `localtime` or `localtimestamp` report!
  - Transform a single query to be reported in a different time zone:
    
    ```pgsql
    SELECT |||dt_col_name||| AT TIME ZONE |||tz_name_or_abbrv|||
    FROM |||tablename|||;
    ```

## Activity
- Using the taxi rides dataset, see if you can:
  - Compute the number of rides given each hour of the day
  - Compute the average cost of rides over each day of the month
  - Compute the median cost of rides over each day of the week
  - Compute the average duration of each ride over each hour of the day
  

<!----- Windows ------>
# Through the Window
## Window Functions
- One useful piece of kit that is only briefly mentioned in the text in Ch11 is that of _window functions_
- A window function is like a mix between a normal column value and an aggregate function
  - Unlike aggregate functions, a window function returns a value for each row
  - Unlike normal column values, a window function can utilize other rows included within its "window" in making an aggregation
- Any normal aggregate function can be used as a window function, though there are specific window functions as well
- Window functions can only be used inside `SELECT` or `ORDER BY` statements
  - They are evaluated _after_ any filtering, grouping, or normal aggregations


## Over the Hill
- The defining characteristic of any window function is the `OVER ()` keyword, which comes after the aggregating window function
- Content inside the `()` determines the "window" of the window function
- By default, if nothing is provided, the entire column is the window
- The below would output the average of the column in every row
  - There will be ways we can also achieve this through subqueries
```pgsql
SELECT AVG(|||col|||) OVER ()
FROM |||table name|||;
```

## Dedicated Window Functions
- You can use any existing aggregate function as a window function, but there are also more specific window functions (full list [here](https://www.postgresql.org/docs/current/functions-window.html))


:::{style='font-size:.8em'}

| Function           | Description                                                                           |
|--------------------|---------------------------------------------------------------------------------------|
| `row_number()`     | Assigns an ascending row number to each row in a window                               |
| `rank()`           | Assigns an ascending rank to each row, with possible ties skipping the next value     |
| `dense_rank()`     | Assigns an ascending rank to each row, with possible ties not skipping the next value |
| `first_value(|||col|||)` | Returns the first value in the window of column `|||col|||`                                 |
| `last_value(|||col|||)`  | Returns the last value in the window of column `|||col|||`                                  |
| `lag(|||col|||)`         | Returns the previous row of column `|||col|||`                                              |
| `lead(|||col|||)`        | Returns the next row of column `|||col|||`                                                  |
| `nth_value(|||col|||, |||n|||)`   | Returns the `|||n|||`th row of column `|||col|||` (`NULL` if doesn't exist)                                                  |

:::

## DETERMINING ORDER
- Often, to be useful, you may want to define an ordering inside the `OVER ()` statement
- As soon as you specify an ordering, **the default window changes**
  - By default, each window now encompasses everything from the first row, up to that current row
    - Easiest to see with classic aggregate functions

```pgsql
SELECT COUNT(*) OVER (ORDER BY |||col|||)
FROM |||table name|||;
```

## Tweaking the Window
- You can tweak this window by specifying the starting and stopping point, using the syntax:
  ```pgsql
  ...|||type||| BETWEEN |||offset||| PRECEDING AND |||offset||| FOLLOWING
  ```
  which appears in the `OVER` clause, after any provided ordering
  - `|||type|||` can be either `ROWS`, `RANGE`, or `GROUP`
  - `|||offset|||` can be 
    - an non-null, non-negative integer or `UNBOUNDED` if the type is `ROWS` or `GROUP`
    - an value that makes sense to add or subtract from the ordered column if the type is `RANGE`


## Excluding the Window
- Can also exclude the current row or group from the window:
  ```pgsql
  ... EXCLUDE |||type|||
  ```
- `|||type|||` here can be:
  - `CURRENT ROW`, which excludes the current row from the window
  - `GROUP` which excludes the current row and all other rows currently tied with it
  - `TIES` which just excludes the tied rows, but keeps the current row

## Partitioning
- Additionally, you can specify a partition for the window
- The default partition is the entire column
- Partitioning here is determined _before_ the rows comprising the window are computed
  - This has a similar feel to `GROUP BY`, but _every_ row will get a value here
- Window functions evaluated within each window within each partition
```pgsql
SELECT AVG(|||column|||) OVER (
    PARTITION BY |||column|||
  )
FROM |||table|||;
```

## Activity!
:::smaller
- Contained [here](../activity_data/halloween.sql) is an SQL file to generate a table of Halloween trick-or-treater data, which contains two columns:
  - The minute within an hour span when a trick-or-treater visited
  - What candy was given to the trick-or-treater
- You had a supply of 40 of each type of candy
- Using this information, answer the following:
  - What candies did you never run out of?
  - At what time did you run out of Starburst?
  - What times did you run out of the three most popular candies?
  - Construct a 5-minute rolling average of the number of trick-or-treaters you saw each minute of the evening. According to this, about when were things busiest?
:::

# Cleaning Tips
## Cleaning: Checking for Duplicates
- We've already seen in homework sample sets that there are sometimes duplicate row entries
- Duplicate entries across _many_ columns is usually the sign
  <!--- A duplicate entry in just one or two columns is frequently not cause for concern (unless you were expecting that column to have unique values)-->
- `GROUP BY` and `HAVING` can be very useful here!
  - `GROUP BY` all the columns that you want to check for repetition
  - Use `HAVING` to only grab those groups that have more than one element (and thus a repetition)
```pgsql
SELECT |||col1, col2, col3, col4|||, COUNT(*)
FROM |||table name|||
GROUP BY |||col1, col2, col3, col4|||
HAVING COUNT(*) > 1;
```

## Cleaning: Checking for Missing
- Data sets can sometimes (often even) be missing data that really should be present
- Fixing this generally requires some knowledge of what the data is representing
- Finding missing values can still tell you important things about the quality of your data though
- A nice way to count the number of nulls in different columns is:
```pgsql
SELECT
  COUNT(*) - COUNT(|||col1|||) as col1,
  COUNT(*) - COUNT(|||col2|||) as col2
FROM |||table name|||;
```

## Cleaning: Inconsistent Data
- Especially for textual fields, there can be variation in how data is entered
  - Typos happen, or people just refer to the same thing in different ways
- To use `GROUP BY` effectively, you really need categories to be consistently named across the data set
- Several possible approaches to identify:
  - Scanning over a sorted distinct column for quasi-duplications: phrases that are just a bit distinct
    - `GROUP BY` could do similar but add counts
  - Use pattern matching to look for duplicates matching a single approximate pattern
  - Use the fuzzystrmatch module to check Levershtein distances (probably rather slow)

## Damage Control
- It can be a lot of work to clean up a table
- Evaluate whether it is worth it!
  - Maybe a better, cleaner data set exists?
- Sometimes, data will be missing that you simply can't fill in
  - Can your analysis work around those columns?


# Break!
## Break Time!
![](../images/nomming_catepillar.png)


<!---------------- SQL ------------------>
<!--
## Overview
- Chapter 10 is essentially a tale in two parts, which are closely entwined
  - Data is likely going to be messy. How do you go about cleaning up data for further analysis. What sorts of anomalies should you keep an eye out for?
  - Tables are usually not static entities, but stores of data that might be changing. How you can make adjustments not only to the contents of a table but to the design of the table itself?
- We could look at either piece first, but let's look first at cleaning data to provide the motivation for altering our tables
  - Many of the issues we'll look at could have been avoided with good use of constraints. But for now we are supposing that the data was generated outside the table, and you have just imported it in naively.



## Cleaning: Checking Text Length
- For certain fields, you'd expect text of a certain number of characters
  - 2 for state abbreviations
  - 5 for zip codes
  - 13 for isbn13
- Can be a good idea to check these using the `LENGTH` string function
  - `LENGTH(str)` just returns the number of characters in said string
```pgsql
SELECT *
FROM |||table name|||
WHERE LENGTH(|||col1|||) != 5;
```

-->

# Changing Tables
## Making Adjustments
::::::cols
::::col
- Changing existing tables can generally be broken down into two categories:
  - Changing the structure of the table itself
    - Uses keywords `ALTER TABLE`
  - Changing the row content within the table
    - Uses keyword `UPDATE`
::::

::::col
![](../images/altering_table.jpg){width=100%}

::::
::::::

## Table Altering: Part 1
- `ALTER TABLE` is generally followed by the table name and then another keyword command, depending on what you want to do
  ```pgsql
  ALTER TABLE |||table name||| ...
  ```
  - Adding columns:
    ```pgsql
    ... ADD COLUMN |||column name||| ||| desired data type|||;
    ```
  - Removing columns
    ```pgsql
    ... DROP COLUMN |||column name|||;
    ```

## Table Altering: Part 2
```pgsql
ALTER TABLE |||table name||| ...
```
- Changing columns
  ```{.pgsql style='font-size:.9em'}
  ... ALTER COLUMN |||column name||| SET DATA TYPE |||data type|||;
  ... ALTER COLUMN |||column name||| SET NOT NULL;
  ```
- Renaming columns
  ```{.pgsql style='font-size:.9em'}
  ... RENAME |||column name||| TO |||new column name|||;
  ```
- Rename entire table
  ```{.pgsql style='font-size:.9em'}
  ... RENAME TO |||new table name|||;
  ```

## Updating Tables
- If you want to change the values in a particular row (or many rows), you don't want to alter the table, you want to `UPDATE` it
- `UPDATE` sets particular columns to a particular value
  - **BE CAREFUL!** If you do not specify _which_ rows, then **ALL** of the rows will have that column changed to that value
    - This is partly why having primary keys is so nice: it gives you a method to update just a single row should you need
```pgsql
UPDATE |||table name|||
SET |||column name||| = |||new value|||;
```
- You can specify which rows should be changed by filtering with `WHERE`

<!--
## Understanding Check
::::::cols
::::col
The table named `revq` to the right is acted upon by the below SQL queries. What entries in the table are left untouched once all queries have been run? Each choice corresponds to those highlighted cells.
::::

::::col

![](../images/uc_ch9_2.svg)

::::
::::::

```{.pgsql style='font-size:.85em;'}
ALTER TABLE revq ADD COLUMN col4 INT;
UPDATE revq SET col4 = col2;
UPDATE revq SET (col2,col3)=(col3, col2) WHERE col1 IN ('B','C');
UPDATE revq SET col3 = col3 - col4;
UPDATE revq SET col2 = col2 + col4 WHERE col3 > 10;
ALTER TABLE revq DROP COLUMN col4;
```
-->


## Backup Tables
- Frequently, if you are about to heavily modify a table, you should consider working on a backup copy
- We actually have already seen the basic machinery for this:
  ```{.pgsql style='font-size:0.9em'}
  CREATE TABLE |||new table||| AS
  SELECT * FROM |||original table|||;
  ```
- **Note:** Indexes and constraints are stored separately, and so are NOT copied over using this process!
- For including constraints and indexes, you can use a Postgres specific syntax, but the newly created table will initially be missing the data
  ```{.pgsql style='font-size:0.9em'}
  CREATE TABLE |||new table|||
  (LIKE |||original table||| INCLUDING ALL);
  ```

## Table to Table
- In some cases, you'll want to update or pass information across tables
  - Maybe one table has newer values that you want to use to update the original table
- In core SQL, you'd need to use subqueries, which we'll be talking about in a few chapters
- In Postgres, to update, you can use `FROM`:
  ```{.pgsql style='font-size:.9em'}
  UPDATE table_name1
  SET |||column name||| = table_name2.|||column name|||
  FROM table_name2
  WHERE table_name1.|||col1||| = table_name2.|||col1|||;
  ```
- To insert values from another table into another:
  ```{.pgsql style='font-size:.9em'}
  INSERT INTO |||new table||| SELECT * FROM |||original table|||;
  ```

## Deletions
- Similar to changing tables, removing things from tables has two main keywords:
  - `DROP` for removing aspects of a table like columns, constraints, indexes, or the table itself
  - `DELETE FROM` for removing rows from tables
- `DROP` will frequently come after an `ALTER TABLE` unless you are dropping the table itself
- `DELETE FROM` without a filter will **delete all rows**
  - Make absolutely sure you are using a filter if you don't want that to happen!
  - Another good reason to make up your tables before editing them


## Getting Deleted
```pgsql
ALTER TABLE |||table name||| DROP COLUMN |||column name|||;
ALTER TABLE |||table name||| DROP CONSTRAINT |||constraint name|||;
DROP INDEX |||index name|||;
DROP TABLE |||table name|||;

DELETE FROM |||table name|||; -- All rows gone!
DELETE FROM |||table name||| WHERE |||condition true|||;
```
- In general, unless you have an important reason, don't remove actual data from a table
  - You can filter it, you can create new tables that are missing that data, etc.

## Practice Activity
- There is a simple CSV of presidents and debt [here](../activity_data/public_debt_increases.sql)
- It has some problems regarding missing, duplicated, or inconsistent data which you should determine and fix _before_ answering the following questions:
  - What are the top 5 presidents to have the greatest average annual increase in national debt over the years of their presidency?
  - How do the median values of annual increases in national debt compare across party lines?
  - Trickier: What is the average _change_ in the annual increase percentage of national debt each year over all the years?


## Groups
- Work on these problems in pairs
  - Only one computer used for SQL at a time (the other can be used for documentation/slides)
  - Rotate who is typing every 5 mins.


## Question Answers
  - What are the top 5 presidents to have the greatest average annual increase in national debt over the years of their presidency?
    - Reagan, HW Bush, Ford, Carter, Obama
  - How do the median values of annual increases in national debt compare across party lines?
    - Democrats: 3.65%
    - Republicans: 7.40%
  - Trickier: What is the average _change_ in the annual increase percentage of national debt overall all the years?
    - Only about 0.05%, but seemingly a slight steady increase



# Tis Transactional
## Transactions
- _Atomicity_ is an important aspect of most database changes
  - The idea that related changes should happen in a single, self-contained step
- Many changes you might make to a database have several steps though!
  - Need to change one value in one table and another value in another table
  - Need to create a new row and then copy some information into it
- Remember that, in general, others can access the database at the same time
  - What if they tried to access the data you were working on mid-operation?
- To solve these issues, SQL has the concept of a _transaction_


## Bundling Up
- A _transaction_ is essentially a bundling of several statements into one, discrete change to the database
- Commands within the transaction have not yet modified the database, but exist only in local memory
- Changes get written to the database all at once upon the conclusion of the transaction
- Starting a transaction?
  - `START TRANSACTION;` or `BEGIN;`
- Ending a transaction?
  - `COMMIT;` actually makes the changes
  - `ROLLBACK;` throws out everything within the transaction


## Uses of Transactions
:::smaller
- Protecting against system faults
  - What if you have a system crash in the middle of an operation?
    - What commands had been run? What commands had not?
  - Transactions actually write to a log what they are _going_ to do before they actually do it. So in case of a crash, then the transaction can then simply be rerun
- Protecting against simultaneous access
  - Changes occur all at once, so it is impossible for another database user to access data "mid-change"
  - Other users of the database will see none of your changes until actually committed
- Testing changes
  - Sometimes it is useful to check to see that some changes look the way you wanted before actually changing the database
  - Embedding within a transaction block always gives you the option to rollback
:::

# Project Transformations
## Transforming Data
- Often you will need to move data from one location to another to organize it, possibly cleaning it along the way
- There are a variety of automation setups that can aid with this, though most all require some scripting knowledge
- For many transformations though, we can accomplish what is necessary purely through SQL and some scheduling

## Project Transformation Checklist
:::smaller
- Create a GitHub repository of your own using the template repository [here](https://github.com/rembold-data-engineering-master/db_transformation_template)
- Download the created repository to your system
- Edit the `clean.sql`{.text} file to add whatever commands you need to move, organize, and clean your data.
- Upload the new `clean.sql`{.text} file back to your repository, keeping the filename the same!
- In your project on Railway, click New and then "GitHub Repo"
  - You'll have to agree to let Railway access your account, and you can specify only certain repositories
  - Select your created repository
- Selecting the repository, go to settings and then scroll down to the CRON schedule to set up a schedule
:::

## Things to keep in mind
- You'll need to create more tables to organize your data. These get run a **single** time, so they should not go in `clean.sql`{.text}!
- It can be useful to empty out your original dumping table after you organize things each time so that you don't "double process" anything
- If you are getting duplicate information occasionally through your API, it can be useful to understand "Upserting":
  ```pgsql
  INSERT INTO |||new table||| VALUES
  SELECT |||special id, other cols||| FROM |||old table|||
  ON CONFLICT |||column or constraint name|||
  DO NOTHING
  ```
