Check the Index

Jed Rembold

Wednesday, October 1, 2025

Announcements

I’ll be sending out grade reports by the end of the week so that you know where you are at in the class
Don’t forget homework 4 due tonight!
Homework 5 will be posted tonight
We’ll discuss the exam in a moment
Polling today: polling.jedrembold.prof

Exam Debrief

Average: 88.5%
Median: 91.67%
St Dev: 12.29%

Review Question

One of the commands below to create a table is valid. Which is it?

CREATE TABLE rev_a (
  "name" TEXT UNIQUE NOT NULL,
  "year" INT,
   CHECK("year" != 2022),
  "class" TEXT
);

CREATE TABLE rev_a (
  "name" TEXT PRIMARY KEY,
  "year" INT PRIMARY KEY,
  "class" TEXT
);

CREATE TABLE rev_a (
  "name" TEXT,
  "year" INT,
  "class" TEXT REFERENCES other,
  CONSTRAINT "name" PRIMARY KEY );

CREATE TABLE rev_a (
  "name" TEXT,
  "year" INT,
  "class" TEXT,
  PRIMARY KEY ("name"),
  UNIQUE ("year","class"));

Indexing

Consulting an Index

Like in a book, an index is a precomputed guide to help find things faster
We can construct similar precomputed guides for certain columns in SQL to also help find things faster!
There are several types of indexes
- By default, the assigned index type is a B-Tree index in Postgres
  - B-Tree stands for balanced tree and work best for orderable type data
- Index types like Generalized Inverted Index (GIN) and Generalized Search Tree (GiST) will be discussed later as they come up
- Different columns in the same table can be indexed with different methods

Climbing a B-Tree

Showing the structure and looking up a key in a B-Tree

Creating Indexes

Postgres will automatically index any column that is a primary key or which has the UNIQUE constraint
You can choose to set up indexes on other columns as well, but do so outside of the table creation
```
CREATE INDEX |||index_name||| ON |||table_name||| (|||col_name|||);
```
If you decide you want to remove an index, you do so using the index name:
```
DROP INDEX |||index_name|||;
```

Benchmarking

One of the prime reasons for creating an index is to speed up access or manipulation of the data later
How can we objectively test this?
- Postgres has the EXPLAIN keyword to give you information about what the database is doing in the background
- Using EXPLAIN ANALYZE will also give you timing information about how long it took a query to run
EXPLAIN always comes at the start of your query
Other SQL variants have their own versions, but most have some method of getting information back about execution time or what is being done under the hood

Benchmarking Caveats

EXPLAIN ANALYZE reports to you the time it took the server to process the query, not necessarily the time it took your client to finish getting and rendering the response!
- Especially for queries that return a huge number of rows, you might see a significant difference
Indexes are optimized for certain types of operations. Just because you index a column doesn’t mean that Postgres will use that index if you are doing an incompatible operation!
- Check more than just the timings from EXPLAIN ANALYZE
  - See comments about Bitmaps and Heaps? Then the index was used.
  - See comments about Parallel or Seq scans? Then the index was not used.

Costs

Indexes always have a cost associated with them, both in initial setup and in every time new data is added to the column
- Also, they are stored information, so they can also inflate your database size
Consider indexing only those columns that receive the heaviest of use in filters or in joining
You never need to worry about indexing primary keys or unique columns, as those are done automatically
When in doubt, benchmark your queries before and after adding an index to see if you are really gaining much from it!

Benchmarking Activity

Most likely, you already (still) have the NYC Taxi Rides data on your computer
Benchmark that data set under three different queries, for each getting an execution time before creating an index and then after creating an index over the desired column (you’ll create 2 different indices total)
- Select all the columns for the row where the ride id = 56789
- Select all the columns for the rows where the total amount charged was greater than $90
- Select all the columns for the rows where the total amount charged was greater than $10
Enter your values into the shared spreadsheet here (Just choose an unused trial number)

Normal Modeling

Data Modeling

Data Modeling is the act of best deciding how to represent and store data such that it relates to the real world
As much as possible, it is generally desirable to:
- ensure the model can easily help answer the types of questions you will want to ask
- operate at as low a level of granularity as possible
- remove redundancy
- ensure referential integrity

Normalization

First introduced by Edgar Codd in the early 1970s
Codd outlined the following goals:
- to free the collection of relations from undesirable insertion, update, and deletion dependencies
- To reduce the need for restructuring the collection of relations as new types of data are introduced
- To make the relational model more informative to users
- To make the collection of relations neutral to what queries are being run on them
To assist in these efforts, Codd introduced the idea of normal forms

What is normal?

Each normal form builds on those before
The primary normal forms are:

Denormalized: No normalization. Nested and redundant data is allowed
First normal form (1NF): Each column is unique and has a single value. The table has a unique primary key.
Second normal form (2NF): Partial dependencies are removed (only necessary if compound primary key)
Third normal form (3NF): Each table contains fields only relevant to its primary key, and has no transitive dependencies

Denormalized

APIs will commonly give denormalized data, since they tend to return JSON type entries
Consider the entry to the right about a particular order

{ "OrderID": 100,
  "OrderItems": [
    {
      "sku": 1,
      "price": 10,
      "quantity": 1,
      "name": "Thingmajig"
    },
    {
      "sku": 2,
      "price": 25,
      "quantity": 2,
      "name": "Whatchamacalit"
    }],
  "CustomerID": 5,
  "CustomerName": "Jed Rembold",
  "OrderDate": "2022-11-09" }

Initial Attempt

OrderID	OrderItems	CustomerID	CustomerName	OrderDate
100	`[ { "sku": 1, "price": 10, "quantity": 1, "name": "Thingmajig" }, { "sku": 2, "price": 25, "quantity": 2, "name": "Whatchamacalit" } ]`	5	Jed Rembold	2022-11-09

1st Normal Form

OrderID	Sku	Price	Quantity	ProductName	CustomerID	CustomerName	OrderDate
100	1	50	1	Thingmajig	5	Jed Rembold	2022-11-09
100	2	25	2	Whatchamacalit	5	Jed Rembold	2022-11-09

1st Normal Form (PK)

OrderID	ItemNum	Sku	Price	Quantity	ProductName	CustomerID	CustomerName	OrderDate
100	1	1	50	1	Thingmajig	5	Jed Rembold	2022-11-09
100	2	2	25	2	Whatchamacalit	5	Jed Rembold	2022-11-09

Evaluating 2NF

To be in second normal form, there can be no partial dependencies, where columns depend on only one of the compound keys
Here though the last 3 columns all depend only on OrderID
- Solution: Split to new tables!

2nd Normal Form

OrderID	CustomerID	CustomerName	OrderDate
100	5	Jed Rembold	2022-11-09

OrderID	ItemNum	Sku	Price	Quantity	ProductName
100	1	1	50	1	Thingmajig
100	2	2	25	2	Whatchamacalit

Evaluating 3NF

The third normal form prohibits transitive dependencies, where a column depends on another (that depends on the primary key), instead of depending directly on the primary key
Here we have 2! One in each table:
- ProductName depends on Sku
- CustomerName depends on CustomerID
Solution? More tables!

3rd Normal Form

Sku	ProductName
1	Thingmajig
2	Whatchamacalit

CustomerID	CustomerName
5	Jed Rembold

OrderID	ItemNum	Sku	Price	Quantity
100	1	1	50	1
100	2	2	25	2

OrderID	CustomerID	OrderDate
100	5	2022-11-09

Visually

Beyond

Other normal forms exist (up to 6NF in the Boyce-Codd system)
Most people only care about the first 3 to get data in a “normalized” state
Unless you have specific performance reasons for wanting otherwise, you really should strive for normalized tables in your relational database, as they make maintaining, adding, updating, and adjusting your database and tables much easier

Activity

Suppose you wanted to track your Spotify playlist information in your own database. Questions you may want to be able to answer might include:

What is the total playtime of this playlist?
What is the most common band in a particular playlist?
Of all the artists who perform songs in any playlist, which artists collaborate with others the most?
Songs from how many different albums are in a particular playlist?

In groups of two or three, sketch out ER diagrams that would allow all these questions to be answered. Your tables should include primary keys and foreign keys where appropriate, and be normalized as best as you are able.

Online sketching resources include: drawsql, ChartDB, dbdiagram, DatabaseDiagram