Some Slight Alterations

Jed Rembold

Wednesday, October 8, 2025

Announcements

Homework 4 feedback is coming! Sorry, exam grading has demanded my attention.
Homework 5 due tonight
- Homework 6 will be posted tonight

The Showdown Returns

The GROUP BY Gauntlet!

Divide into groups based on the next slide. It is time for another SQL showdown!
Each group should have:
- one person designated as the answer submitter
- one person designated as the SQL typer (must rotate each question)
- one computer that has the superheroes tables from HW4 on it
Navigate to pollev.com/jedrembold441 and come up with a fun group name
You’ll have about 5 minutes to answer each question. Submitting an answer faster gets you more points!

Groups

Group 1: Cecil, Emma S, Kara, Julia
Group 2: Jose, Catalina, Jimmy
Group 3: Beyla, Emma M, Greg, Thomas
Group 4: Addison, Nico, Sophie
Group 5: Meghla, Connor, Finn, Evyn

Group 6: Michael, Tegan, Cody, Jackson
Group 7: KJ, Evan, Talia
Group 8: Farley, Oliver, Jude, Dominic
Group 9: Salvador, Aaron, Saul
Group 10: Lola, Judah, Mia

Locations!

Chapter 10

Overview

Chapter 10 is essentially a tale in two parts, which are closely entwined
- Data is likely going to be messy. How do you go about cleaning up data for further analysis. What sorts of anomolies should you keep an eye out for?
- Tables are usually not static entities, but stores of data that might be changing. How you can make adjustments not only to the contents of a table but to the design of the table itself?

Data Cleaning

Cleaning: Checking for Duplicates

We’ve already seen in homework sample sets that there are sometimes duplicate row entries
Duplicate entries across many columns is usually the sign
GROUP BY and HAVING can be very useful here!
- GROUP BY all the columns that you want to check for repetition
- Use HAVING to only grab those groups that have more than one element (and thus a repetition)
```
SELECT |||col1|||, |||col2|||, |||col3|||, |||col4|||, COUNT(*)
FROM |||table_name|||
GROUP BY |||col1|||, |||col2|||, |||col3|||, |||col4|||
HAVING COUNT(*) > 1;
```

Cleaning: Checking for Missing

Data sets can sometimes (often even) be missing data that really should be present
Fixing this generally requires some knowledge of what the data is representing
Finding missing values can still tell you important things about the quality of your data though
A nice way to count the number of nulls in different columns is:

SELECT
  COUNT(*) - COUNT(|||col1|||) as |||col1|||,
  COUNT(*) - COUNT(|||col2|||) as |||col2|||
FROM |||table_name|||;

Cleaning: Inconsistent Data

Especially for textual fields, there can be variation in how data is entered
- Typos happen, or people just refer to the same thing in different ways
To use GROUP BY effectively, you really need categories to be consistently named across the data set
Several possible approaches to identify:
- Scanning over a distinct column for quasi-duplications: phrases that are just a bit distinct
  - GROUP BY could do similar but add counts
- Use pattern matching to look for duplicates matching a single approximate pattern
- Use the fuzzystrmatch module to check Levershtein distances

Cleaning: Checking Text Length

For certain fields, you’d expect text of a certain number of characters
- 2 for state abbreviations
- 5 for zip codes
- 13 for isbn13
Can be a good idea to check these using the LENGTH string function
- LENGTH(str) just returns the number of characters in said string
```
SELECT *
FROM |||table_name|||
WHERE LENGTH(|||col1|||) != 5;
```

Damage Control

It can be a lot of work to clean up a table
Evaluate whether it is worth it!
- Maybe a better, cleaner data set exists?
Sometimes, data will be missing that you simply can’t fill in
- Can your analysis work around those columns?

Table Tweaking

Making Adjustments

Changing existing tables can generally be broken down into two categories:
- Changing the structure of the table itself
  - Uses keywords ALTER TABLE
- Changing the row content within the table
  - Uses keyword UPDATE

Table Altering: Part 1

ALTER TABLE is generally followed by the table name and then another keyword command, depending on what you want to do

ALTER TABLE |||table_name||| ...

Adding columns:

... ADD COLUMN |||col_name||| |||data_type|||;

Removing columns
```
... DROP COLUMN |||col_name|||;
```

Table Altering: Part 2

ALTER TABLE |||table_name||| ...

Changing columns

... ALTER COLUMN |||col_name||| SET DATA TYPE |||data_type|||;
... ALTER COLUMN |||col_name||| SET NOT NULL;

Renaming columns

... RENAME |||col_name||| TO |||new_col_name|||;

Rename entire table
```
... RENAME TO |||new_table_name|||;
```

Updating Tables

If you want to change the values in a particular row (or many rows), you don’t want to alter the table, you want to UPDATE it
UPDATE sets particular columns to a particular value
- BE CAREFUL! If you do not specify which rows, then ALL of the rows will have that column changed to that value
  - This is partly why having primary keys is so nice: it gives you a method to update just a single row should you need
```
UPDATE |||table_name|||
SET |||col_name||| = |||new_value|||;
```
You can specify which rows should be changed by filtering with WHERE

Table to Table

In some cases, you’ll want to update information across tables
- Maybe one table has newer values that you want to use to update the original table
In core SQL, you’d need to use subqueries, which we’ll be talking about in a few chapters
In Postgres, you can use FROM:

UPDATE |||table_name|||
SET |||col_name||| = |||table_name₂|||.|||col_name|||
FROM |||table_name₂|||
WHERE |||table_name|||.|||col_name₂||| = |||table_name₂|||.|||col_name₂|||