Windows into Transformation

Jed Rembold

March 12, 2025

Announcements

  • Homework 7 is due next Wednesday night (coming out this evening)
  • My grading quest continues
  • Project Pairs/Groups were sent out and Project Guidelines shared!
    • Some good target deadlines would be to:
      • Have your scraper up and scraping by this weekend
      • Have a transforming/cleaning container up and running by the end of next week (to be introduced later today)

Getting in the Zone

Time Zones

  • Dealing with time zones can be a headache, and it is a very nice feature that Postgres can work with them smoothly
  • By default, Postgres will display any timestamp with a time zone with the time as you would measure it in your current system timezone
  • What is your current system timezone?
    • SHOW timezone;
  • Getting general information about timezones:
    • Getting abbreviations:
      • SELECT * FROM pg_timezone_abbrevs;
    • Getting full names:
      • SELECT * FROM pg_timezone_names;

Teleportation

  • It can sometimes be useful to switch your “current” time zone
    • Maybe it is easier to compare times to someone else living in that time zone
  • Several methods to make the switch:
    • Change your postgressql.conf file, which controls your Postgres server. Only recommended if you have permanently moved elsewhere and the database time zone has not updated appropriately.

    • Set future queries in a single session to be from a new timezone:

      SET |||timezone||| TO |||time_zone_name_or_abbrv|||;

      • This will also adjust what values your localtime or localtimestamp report!
    • Transform a single query to be reported in a different time zone:

      SELECT |||dt_col_name||| AT TIME ZONE |||tz_name_or_abbrv|||
      FROM |||tablename|||;

Activity

  • Using the taxi rides dataset, see if you can:
    • Compute the number of rides given each hour of the day
    • Compute the average cost of rides over each day of the month
    • Compute the median cost of rides over each day of the week
    • Compute the average duration of each ride over each hour of the day

Through the Window

Window Functions

  • One useful piece of kit that is only briefly mentioned in the text in Ch11 is that of window functions
  • A window function is like a mix between a normal column value and an aggregate function
    • Unlike aggregate functions, a window function returns a value for each row
    • Unlike normal column values, a window function can utilize other rows included within its “window” in making an aggregation
  • Any normal aggregate function can be used as a window function, though there are specific window functions as well
  • Window functions can only be used inside SELECT or ORDER BY statements
    • They are evaluated after any filtering, grouping, or normal aggregations

Over the Hill

  • The defining characteristic of any window function is the OVER () keyword, which comes after the aggregating window function
  • Content inside the () determines the “window” of the window function
  • By default, if nothing is provided, the entire column is the window
  • The below would output the average of the column in every row
    • There will be ways we can also achieve this through subqueries
    SELECT AVG(|||col|||) OVER ()
    FROM |||table name|||;

Dedicated Window Functions

  • You can use any existing aggregate function as a window function, but there are also more specific window functions (full list here)
Function Description
row_number() Assigns an ascending row number to each row in a window
rank() Assigns an ascending rank to each row, with possible ties skipping the next value
dense_rank() Assigns an ascending rank to each row, with possible ties not skipping the next value
first_value(|||col|||) Returns the first value in the window of column |||col|||
last_value(|||col|||) Returns the last value in the window of column |||col|||
lag(|||col|||) Returns the previous row of column |||col|||
lead(|||col|||) Returns the next row of column |||col|||
nth_value(|||col|||, |||n|||) Returns the |||n|||th row of column |||col||| (NULL if doesn’t exist)

DETERMINING ORDER

  • Often, to be useful, you may want to define an ordering inside the OVER () statement
  • As soon as you specify an ordering, the default window changes
    • By default, each window now encompasses everything from the first row, up to that current row
      • Easiest to see with classic aggregate functions
SELECT COUNT(*) OVER (ORDER BY |||col|||)
FROM |||table name|||;

Tweaking the Window

  • You can tweak this window by specifying the starting and stopping point, using the syntax:

    ...|||type||| BETWEEN |||offset||| PRECEDING AND |||offset||| FOLLOWING

    which appears in the OVER clause, after any provided ordering

    • |||type||| can be either ROWS, RANGE, or GROUP
    • |||offset||| can be
      • an non-null, non-negative integer or UNBOUNDED if the type is ROWS or GROUP
      • an value that makes sense to add or subtract from the ordered column if the type is RANGE

Excluding the Window

  • Can also exclude the current row or group from the window:

    ... EXCLUDE |||type|||
  • |||type||| here can be:

    • CURRENT ROW, which excludes the current row from the window
    • GROUP which excludes the current row and all other rows currently tied with it
    • TIES which just excludes the tied rows, but keeps the current row

Partitioning

  • Additionally, you can specify a partition for the window
  • The default partition is the entire column
  • Partitioning here is determined before the rows comprising the window are computed
    • This has a similar feel to GROUP BY, but every row will get a value here
  • Window functions evaluated within each window within each partition
SELECT AVG(|||column|||) OVER (
    PARTITION BY |||column|||
  )
FROM |||table|||;

Activity!

  • Contained here is an SQL file to generate a table of Halloween trick-or-treater data, which contains two columns:
    • The minute within an hour span when a trick-or-treater visited
    • What candy was given to the trick-or-treater
  • You had a supply of 40 of each type of candy
  • Using this information, answer the following:
    • What candies did you never run out of?
    • At what time did you run out of Starburst?
    • What times did you run out of the three most popular candies?
    • Construct a 5-minute rolling average of the number of trick-or-treaters you saw each minute of the evening. According to this, about when were things busiest?

Cleaning Tips

Cleaning: Checking for Duplicates

  • We’ve already seen in homework sample sets that there are sometimes duplicate row entries
  • Duplicate entries across many columns is usually the sign
  • GROUP BY and HAVING can be very useful here!
    • GROUP BY all the columns that you want to check for repetition
    • Use HAVING to only grab those groups that have more than one element (and thus a repetition)
    SELECT |||col1, col2, col3, col4|||, COUNT(*)
    FROM |||table name|||
    GROUP BY |||col1, col2, col3, col4|||
    HAVING COUNT(*) > 1;

Cleaning: Checking for Missing

  • Data sets can sometimes (often even) be missing data that really should be present
  • Fixing this generally requires some knowledge of what the data is representing
  • Finding missing values can still tell you important things about the quality of your data though
  • A nice way to count the number of nulls in different columns is:
SELECT
  COUNT(*) - COUNT(|||col1|||) as col1,
  COUNT(*) - COUNT(|||col2|||) as col2
FROM |||table name|||;

Cleaning: Inconsistent Data

  • Especially for textual fields, there can be variation in how data is entered
    • Typos happen, or people just refer to the same thing in different ways
  • To use GROUP BY effectively, you really need categories to be consistently named across the data set
  • Several possible approaches to identify:
    • Scanning over a sorted distinct column for quasi-duplications: phrases that are just a bit distinct
      • GROUP BY could do similar but add counts
    • Use pattern matching to look for duplicates matching a single approximate pattern
    • Use the fuzzystrmatch module to check Levershtein distances (probably rather slow)

Damage Control

  • It can be a lot of work to clean up a table
  • Evaluate whether it is worth it!
    • Maybe a better, cleaner data set exists?
  • Sometimes, data will be missing that you simply can’t fill in
    • Can your analysis work around those columns?

Break!

Break Time!

Changing Tables

Making Adjustments

  • Changing existing tables can generally be broken down into two categories:
    • Changing the structure of the table itself
      • Uses keywords ALTER TABLE
    • Changing the row content within the table
      • Uses keyword UPDATE

Table Altering: Part 1

  • ALTER TABLE is generally followed by the table name and then another keyword command, depending on what you want to do

    ALTER TABLE |||table name||| ...
    • Adding columns:

      ... ADD COLUMN |||column name||| ||| desired data type|||;
    • Removing columns

      ... DROP COLUMN |||column name|||;

Table Altering: Part 2

ALTER TABLE |||table name||| ...
  • Changing columns

    ... ALTER COLUMN |||column name||| SET DATA TYPE |||data type|||;
    ... ALTER COLUMN |||column name||| SET NOT NULL;
  • Renaming columns

    ... RENAME |||column name||| TO |||new column name|||;
  • Rename entire table

    ... RENAME TO |||new table name|||;

Updating Tables

  • If you want to change the values in a particular row (or many rows), you don’t want to alter the table, you want to UPDATE it
  • UPDATE sets particular columns to a particular value
    • BE CAREFUL! If you do not specify which rows, then ALL of the rows will have that column changed to that value
      • This is partly why having primary keys is so nice: it gives you a method to update just a single row should you need
      UPDATE |||table name|||
      SET |||column name||| = |||new value|||;
  • You can specify which rows should be changed by filtering with WHERE

Backup Tables

  • Frequently, if you are about to heavily modify a table, you should consider working on a backup copy

  • We actually have already seen the basic machinery for this:

    CREATE TABLE |||new table||| AS
    SELECT * FROM |||original table|||;
  • Note: Indexes and constraints are stored separately, and so are NOT copied over using this process!

  • For including constraints and indexes, you can use a Postgres specific syntax, but the newly created table will initially be missing the data

    CREATE TABLE |||new table|||
    (LIKE |||original table||| INCLUDING ALL);

Table to Table

  • In some cases, you’ll want to update or pass information across tables

    • Maybe one table has newer values that you want to use to update the original table
  • In core SQL, you’d need to use subqueries, which we’ll be talking about in a few chapters

  • In Postgres, to update, you can use FROM:

    UPDATE table_name1
    SET |||column name||| = table_name2.|||column name|||
    FROM table_name2
    WHERE table_name1.|||col1||| = table_name2.|||col1|||;
  • To insert values from another table into another:

    INSERT INTO |||new table||| SELECT * FROM |||original table|||;

Deletions

  • Similar to changing tables, removing things from tables has two main keywords:
    • DROP for removing aspects of a table like columns, constraints, indexes, or the table itself
    • DELETE FROM for removing rows from tables
  • DROP will frequently come after an ALTER TABLE unless you are dropping the table itself
  • DELETE FROM without a filter will delete all rows
    • Make absolutely sure you are using a filter if you don’t want that to happen!
    • Another good reason to make up your tables before editing them

Getting Deleted

ALTER TABLE |||table name||| DROP COLUMN |||column name|||;
ALTER TABLE |||table name||| DROP CONSTRAINT |||constraint name|||;
DROP INDEX |||index name|||;
DROP TABLE |||table name|||;

DELETE FROM |||table name|||; -- All rows gone!
DELETE FROM |||table name||| WHERE |||condition true|||;
  • In general, unless you have an important reason, don’t remove actual data from a table
    • You can filter it, you can create new tables that are missing that data, etc.

Practice Activity

  • There is a simple CSV of presidents and debt here
  • It has some problems regarding missing, duplicated, or inconsistent data which you should determine and fix before answering the following questions:
    • What are the top 5 presidents to have the greatest average annual increase in national debt over the years of their presidency?
    • How do the median values of annual increases in national debt compare across party lines?
    • Trickier: What is the average change in the annual increase percentage of national debt each year over all the years?

Groups

  • Work on these problems in pairs
    • Only one computer used for SQL at a time (the other can be used for documentation/slides)
    • Rotate who is typing every 5 mins.

Question Answers

  • What are the top 5 presidents to have the greatest average annual increase in national debt over the years of their presidency?
    • Reagan, HW Bush, Ford, Carter, Obama
  • How do the median values of annual increases in national debt compare across party lines?
    • Democrats: 3.65%
    • Republicans: 7.40%
  • Trickier: What is the average change in the annual increase percentage of national debt overall all the years?
    • Only about 0.05%, but seemingly a slight steady increase

Tis Transactional

Transactions

  • Atomicity is an important aspect of most database changes
    • The idea that related changes should happen in a single, self-contained step
  • Many changes you might make to a database have several steps though!
    • Need to change one value in one table and another value in another table
    • Need to create a new row and then copy some information into it
  • Remember that, in general, others can access the database at the same time
    • What if they tried to access the data you were working on mid-operation?
  • To solve these issues, SQL has the concept of a transaction

Bundling Up

  • A transaction is essentially a bundling of several statements into one, discrete change to the database
  • Commands within the transaction have not yet modified the database, but exist only in local memory
  • Changes get written to the database all at once upon the conclusion of the transaction
  • Starting a transaction?
    • START TRANSACTION; or BEGIN;
  • Ending a transaction?
    • COMMIT; actually makes the changes
    • ROLLBACK; throws out everything within the transaction

Uses of Transactions

  • Protecting against system faults
    • What if you have a system crash in the middle of an operation?
      • What commands had been run? What commands had not?
    • Transactions actually write to a log what they are going to do before they actually do it. So in case of a crash, then the transaction can then simply be rerun
  • Protecting against simultaneous access
    • Changes occur all at once, so it is impossible for another database user to access data “mid-change”
    • Other users of the database will see none of your changes until actually committed
  • Testing changes
    • Sometimes it is useful to check to see that some changes look the way you wanted before actually changing the database
    • Embedding within a transaction block always gives you the option to rollback

Project Transformations

Transforming Data

  • Often you will need to move data from one location to another to organize it, possibly cleaning it along the way
  • There are a variety of automation setups that can aid with this, though most all require some scripting knowledge
  • For many transformations though, we can accomplish what is necessary purely through SQL and some scheduling

Project Transformation Checklist

  • Create a GitHub repository of your own using the template repository here
  • Download the created repository to your system
  • Edit the clean.sql file to add whatever commands you need to move, organize, and clean your data.
  • Upload the new clean.sql file back to your repository, keeping the filename the same!
  • In your project on Railway, click New and then “GitHub Repo”
    • You’ll have to agree to let Railway access your account, and you can specify only certain repositories
    • Select your created repository
  • Selecting the repository, go to settings and then scroll down to the CRON schedule to set up a schedule

Things to keep in mind

  • You’ll need to create more tables to organize your data. These get run a single time, so they should not go in clean.sql!

  • It can be useful to empty out your original dumping table after you organize things each time so that you don’t “double process” anything

  • If you are getting duplicate information occasionally through your API, it can be useful to understand “Upserting”:

    INSERT INTO |||new table||| VALUES
    SELECT |||special id, other cols||| FROM |||old table|||
    ON CONFLICT |||column or constraint name|||
    DO NOTHING
// reveal.js plugins // Added plugins