Windows into Transformation

Jed Rembold

March 12, 2025

Announcements

Homework 7 is due next Wednesday night (coming out this evening)
My grading quest continues
Project Pairs/Groups were sent out and Project Guidelines shared!
- Some good target deadlines would be to:
  - Have your scraper up and scraping by this weekend
  - Have a transforming/cleaning container up and running by the end of next week (to be introduced later today)

Getting in the Zone

Time Zones

Dealing with time zones can be a headache, and it is a very nice feature that Postgres can work with them smoothly
By default, Postgres will display any timestamp with a time zone with the time as you would measure it in your current system timezone
What is your current system timezone?
- SHOW timezone;
Getting general information about timezones:
- Getting abbreviations:
  - SELECT * FROM pg_timezone_abbrevs;
- Getting full names:
  - SELECT * FROM pg_timezone_names;

Teleportation

It can sometimes be useful to switch your “current” time zone
- Maybe it is easier to compare times to someone else living in that time zone
Several methods to make the switch:
- Change your postgressql.conf file, which controls your Postgres server. Only recommended if you have permanently moved elsewhere and the database time zone has not updated appropriately.
- Set future queries in a single session to be from a new timezone:
  
  SET |||timezone||| TO |||time_zone_name_or_abbrv|||;
  - This will also adjust what values your localtime or localtimestamp report!
- Transform a single query to be reported in a different time zone:
```
SELECT |||dt_col_name||| AT TIME ZONE |||tz_name_or_abbrv|||
FROM |||tablename|||;
```

Activity

Using the taxi rides dataset, see if you can:
- Compute the number of rides given each hour of the day
- Compute the average cost of rides over each day of the month
- Compute the median cost of rides over each day of the week
- Compute the average duration of each ride over each hour of the day

Through the Window

Window Functions

One useful piece of kit that is only briefly mentioned in the text in Ch11 is that of window functions
A window function is like a mix between a normal column value and an aggregate function
- Unlike aggregate functions, a window function returns a value for each row
- Unlike normal column values, a window function can utilize other rows included within its “window” in making an aggregation
Any normal aggregate function can be used as a window function, though there are specific window functions as well
Window functions can only be used inside SELECT or ORDER BY statements
- They are evaluated after any filtering, grouping, or normal aggregations

Over the Hill

The defining characteristic of any window function is the OVER () keyword, which comes after the aggregating window function
Content inside the () determines the “window” of the window function
By default, if nothing is provided, the entire column is the window
The below would output the average of the column in every row
- There will be ways we can also achieve this through subqueries
```
SELECT AVG(|||col|||) OVER ()
FROM |||table name|||;
```

Dedicated Window Functions

You can use any existing aggregate function as a window function, but there are also more specific window functions (full list here)

Function	Description
`row_number()`	Assigns an ascending row number to each row in a window
`rank()`	Assigns an ascending rank to each row, with possible ties skipping the next value
`dense_rank()`	Assigns an ascending rank to each row, with possible ties not skipping the next value
`first_value(\|\|\|col\|\|\|)`	Returns the first value in the window of column `\|\|\|col\|\|\|`
`last_value(\|\|\|col\|\|\|)`	Returns the last value in the window of column `\|\|\|col\|\|\|`
`lag(\|\|\|col\|\|\|)`	Returns the previous row of column `\|\|\|col\|\|\|`
`lead(\|\|\|col\|\|\|)`	Returns the next row of column `\|\|\|col\|\|\|`
`nth_value(\|\|\|col\|\|\|, \|\|\|n\|\|\|)`	Returns the `\|\|\|n\|\|\|`th row of column `\|\|\|col\|\|\|` (`NULL` if doesn’t exist)

DETERMINING ORDER

Often, to be useful, you may want to define an ordering inside the OVER () statement
As soon as you specify an ordering, the default window changes
- By default, each window now encompasses everything from the first row, up to that current row
  - Easiest to see with classic aggregate functions

SELECT COUNT(*) OVER (ORDER BY |||col|||)
FROM |||table name|||;

Tweaking the Window

You can tweak this window by specifying the starting and stopping point, using the syntax:
```
...|||type||| BETWEEN |||offset||| PRECEDING AND |||offset||| FOLLOWING
```
which appears in the OVER clause, after any provided ordering
- |||type||| can be either ROWS, RANGE, or GROUP
- |||offset||| can be
  - an non-null, non-negative integer or UNBOUNDED if the type is ROWS or GROUP
  - an value that makes sense to add or subtract from the ordered column if the type is RANGE

Excluding the Window

Can also exclude the current row or group from the window:
```
... EXCLUDE |||type|||
```
|||type||| here can be:
- CURRENT ROW, which excludes the current row from the window
- GROUP which excludes the current row and all other rows currently tied with it
- TIES which just excludes the tied rows, but keeps the current row

Partitioning

Additionally, you can specify a partition for the window
The default partition is the entire column
Partitioning here is determined before the rows comprising the window are computed
- This has a similar feel to GROUP BY, but every row will get a value here
Window functions evaluated within each window within each partition

SELECT AVG(|||column|||) OVER (
    PARTITION BY |||column|||
  )
FROM |||table|||;

Activity!

Contained here is an SQL file to generate a table of Halloween trick-or-treater data, which contains two columns:
- The minute within an hour span when a trick-or-treater visited
- What candy was given to the trick-or-treater
You had a supply of 40 of each type of candy
Using this information, answer the following:
- What candies did you never run out of?
- At what time did you run out of Starburst?
- What times did you run out of the three most popular candies?
- Construct a 5-minute rolling average of the number of trick-or-treaters you saw each minute of the evening. According to this, about when were things busiest?

Cleaning Tips

Cleaning: Checking for Duplicates

We’ve already seen in homework sample sets that there are sometimes duplicate row entries
Duplicate entries across many columns is usually the sign
GROUP BY and HAVING can be very useful here!
- GROUP BY all the columns that you want to check for repetition
- Use HAVING to only grab those groups that have more than one element (and thus a repetition)
```
SELECT |||col1, col2, col3, col4|||, COUNT(*)
FROM |||table name|||
GROUP BY |||col1, col2, col3, col4|||
HAVING COUNT(*) > 1;
```

Cleaning: Checking for Missing

Data sets can sometimes (often even) be missing data that really should be present
Fixing this generally requires some knowledge of what the data is representing
Finding missing values can still tell you important things about the quality of your data though
A nice way to count the number of nulls in different columns is:

SELECT
  COUNT(*) - COUNT(|||col1|||) as col1,
  COUNT(*) - COUNT(|||col2|||) as col2
FROM |||table name|||;

Cleaning: Inconsistent Data

Especially for textual fields, there can be variation in how data is entered
- Typos happen, or people just refer to the same thing in different ways
To use GROUP BY effectively, you really need categories to be consistently named across the data set
Several possible approaches to identify:
- Scanning over a sorted distinct column for quasi-duplications: phrases that are just a bit distinct
  - GROUP BY could do similar but add counts
- Use pattern matching to look for duplicates matching a single approximate pattern
- Use the fuzzystrmatch module to check Levershtein distances (probably rather slow)

Damage Control

It can be a lot of work to clean up a table
Evaluate whether it is worth it!
- Maybe a better, cleaner data set exists?
Sometimes, data will be missing that you simply can’t fill in
- Can your analysis work around those columns?

Break!

Break Time!

Changing Tables

Making Adjustments

Changing existing tables can generally be broken down into two categories:
- Changing the structure of the table itself
  - Uses keywords ALTER TABLE
- Changing the row content within the table
  - Uses keyword UPDATE

Table Altering: Part 1

ALTER TABLE is generally followed by the table name and then another keyword command, depending on what you want to do
```
ALTER TABLE |||table name||| ...
```
- Adding columns:
```
... ADD COLUMN |||column name||| ||| desired data type|||;
```
- Removing columns
```
... DROP COLUMN |||column name|||;
```

Table Altering: Part 2

ALTER TABLE |||table name||| ...

Changing columns

... ALTER COLUMN |||column name||| SET DATA TYPE |||data type|||;
... ALTER COLUMN |||column name||| SET NOT NULL;

Renaming columns

... RENAME |||column name||| TO |||new column name|||;

Rename entire table
```
... RENAME TO |||new table name|||;
```

Updating Tables

If you want to change the values in a particular row (or many rows), you don’t want to alter the table, you want to UPDATE it
UPDATE sets particular columns to a particular value
- BE CAREFUL! If you do not specify which rows, then ALL of the rows will have that column changed to that value
  - This is partly why having primary keys is so nice: it gives you a method to update just a single row should you need
```
UPDATE |||table name|||
SET |||column name||| = |||new value|||;
```
You can specify which rows should be changed by filtering with WHERE

Backup Tables

Frequently, if you are about to heavily modify a table, you should consider working on a backup copy

We actually have already seen the basic machinery for this:

CREATE TABLE |||new table||| AS
SELECT * FROM |||original table|||;

Note: Indexes and constraints are stored separately, and so are NOT copied over using this process!
For including constraints and indexes, you can use a Postgres specific syntax, but the newly created table will initially be missing the data
```
CREATE TABLE |||new table|||
(LIKE |||original table||| INCLUDING ALL);
```

Table to Table

In some cases, you’ll want to update or pass information across tables
- Maybe one table has newer values that you want to use to update the original table
In core SQL, you’d need to use subqueries, which we’ll be talking about in a few chapters

In Postgres, to update, you can use FROM:

UPDATE table_name1
SET |||column name||| = table_name2.|||column name|||
FROM table_name2
WHERE table_name1.|||col1||| = table_name2.|||col1|||;

To insert values from another table into another:

INSERT INTO |||new table||| SELECT * FROM |||original table|||;

Deletions

Similar to changing tables, removing things from tables has two main keywords:
- DROP for removing aspects of a table like columns, constraints, indexes, or the table itself
- DELETE FROM for removing rows from tables
DROP will frequently come after an ALTER TABLE unless you are dropping the table itself
DELETE FROM without a filter will delete all rows
- Make absolutely sure you are using a filter if you don’t want that to happen!
- Another good reason to make up your tables before editing them

Getting Deleted

ALTER TABLE |||table name||| DROP COLUMN |||column name|||;
ALTER TABLE |||table name||| DROP CONSTRAINT |||constraint name|||;
DROP INDEX |||index name|||;
DROP TABLE |||table name|||;

DELETE FROM |||table name|||; -- All rows gone!
DELETE FROM |||table name||| WHERE |||condition true|||;

In general, unless you have an important reason, don’t remove actual data from a table
- You can filter it, you can create new tables that are missing that data, etc.

Practice Activity

There is a simple CSV of presidents and debt here
It has some problems regarding missing, duplicated, or inconsistent data which you should determine and fix before answering the following questions:
- What are the top 5 presidents to have the greatest average annual increase in national debt over the years of their presidency?
- How do the median values of annual increases in national debt compare across party lines?
- Trickier: What is the average change in the annual increase percentage of national debt each year over all the years?

Groups

Work on these problems in pairs
- Only one computer used for SQL at a time (the other can be used for documentation/slides)
- Rotate who is typing every 5 mins.

Question Answers

What are the top 5 presidents to have the greatest average annual increase in national debt over the years of their presidency?
- Reagan, HW Bush, Ford, Carter, Obama
How do the median values of annual increases in national debt compare across party lines?
- Democrats: 3.65%
- Republicans: 7.40%
Trickier: What is the average change in the annual increase percentage of national debt overall all the years?
- Only about 0.05%, but seemingly a slight steady increase

Tis Transactional

Transactions

Atomicity is an important aspect of most database changes
- The idea that related changes should happen in a single, self-contained step
Many changes you might make to a database have several steps though!
- Need to change one value in one table and another value in another table
- Need to create a new row and then copy some information into it
Remember that, in general, others can access the database at the same time
- What if they tried to access the data you were working on mid-operation?
To solve these issues, SQL has the concept of a transaction

Bundling Up

A transaction is essentially a bundling of several statements into one, discrete change to the database
Commands within the transaction have not yet modified the database, but exist only in local memory
Changes get written to the database all at once upon the conclusion of the transaction
Starting a transaction?
- START TRANSACTION; or BEGIN;
Ending a transaction?
- COMMIT; actually makes the changes
- ROLLBACK; throws out everything within the transaction

Uses of Transactions

Protecting against system faults
- What if you have a system crash in the middle of an operation?
  - What commands had been run? What commands had not?
- Transactions actually write to a log what they are going to do before they actually do it. So in case of a crash, then the transaction can then simply be rerun
Protecting against simultaneous access
- Changes occur all at once, so it is impossible for another database user to access data “mid-change”
- Other users of the database will see none of your changes until actually committed
Testing changes
- Sometimes it is useful to check to see that some changes look the way you wanted before actually changing the database
- Embedding within a transaction block always gives you the option to rollback

Project Transformations

Transforming Data

Often you will need to move data from one location to another to organize it, possibly cleaning it along the way
There are a variety of automation setups that can aid with this, though most all require some scripting knowledge
For many transformations though, we can accomplish what is necessary purely through SQL and some scheduling

Project Transformation Checklist

Create a GitHub repository of your own using the template repository here
Download the created repository to your system
Edit the clean.sql file to add whatever commands you need to move, organize, and clean your data.
Upload the new clean.sql file back to your repository, keeping the filename the same!
In your project on Railway, click New and then “GitHub Repo”
- You’ll have to agree to let Railway access your account, and you can specify only certain repositories
- Select your created repository
Selecting the repository, go to settings and then scroll down to the CRON schedule to set up a schedule

Things to keep in mind

You’ll need to create more tables to organize your data. These get run a single time, so they should not go in clean.sql!
It can be useful to empty out your original dumping table after you organize things each time so that you don’t “double process” anything

If you are getting duplicate information occasionally through your API, it can be useful to understand “Upserting”:

INSERT INTO |||new table||| VALUES
SELECT |||special id, other cols||| FROM |||old table|||
ON CONFLICT |||column or constraint name|||
DO NOTHING

Function	Description
`row_number()`	Assigns an ascending row number to each row in a window
`rank()`	Assigns an ascending rank to each row, with possible ties skipping the next value
`dense_rank()`	Assigns an ascending rank to each row, with possible ties not skipping the next value
`first_value(\|\|\|col\|\|\|)`	Returns the first value in the window of column `\|\|\|col\|\|\|`
`last_value(\|\|\|col\|\|\|)`	Returns the last value in the window of column `\|\|\|col\|\|\|`
`lag(\|\|\|col\|\|\|)`	Returns the previous row of column `\|\|\|col\|\|\|`
`lead(\|\|\|col\|\|\|)`	Returns the next row of column `\|\|\|col\|\|\|`
`nth_value(\|\|\|col\|\|\|, \|\|\|n\|\|\|)`	Returns the `\|\|\|n\|\|\|`th row of column `\|\|\|col\|\|\|` (`NULL` if doesn’t exist)