Foreign Subqueries about Pivots

Jed Rembold

March 19, 2025

Announcements

Homework
- HW7 is due tonight
- HW8 is coming out tonight, due Wed after break
- I’m spending part of my break getting caught up
Today is going to be mostly SQL focused
Projects
- Is your scraper up and running? Ensure that it is before Spring Break starts!
- Is your automatic cleaner/transformer up and running? Try to have that done before we meet next
- We’ll be talking the serving components after break

Database Connections

Connecting Databases

Postgres has a few ways you can access tables from one database while in another
Both are extensions that you would need to activate
- dblink: Simple for one-off data access
- postgres_fdw: Better if you need a more permanent connection
Both can be useful, so I’ll try to touch on each

The Connection

Accessing data using dblink is a two-step process: connect and then query

To connect, use dblink_connect (shocking!)

SELECT dblink_connect(|||connection string|||)

The connection string can either be a full database url, or a space delimited sequence of keywords, e.g.:
```
'host=localhost user=postgres password=1234'
```

Querying

Once the connection has been made (within a session) you can send across queries
```
SELECT * FROM dblink(|||query string|||)
```
dblink is essentially blind, in that it has no idea what would be returned by the query
- As such, similar to creating a pivot table (coming soon!), you must supply a table alias and columns/types at the end
```
SELECT * 
FROM dblink(|||query string|||)
AS |||table alias||| ( |||col₁||| |||type₁|||, |||col₂||| |||type₂|||, |||etc.|||);
```

Executing

If you have a command to run remotely, and don’t need a table back, you can use dblink_exec
This may be useful for creating, updating or deleting data from a remote table.
```
SELECT dblink_exec(
  'CREATE TABLE test (col1 text, col2 int)'
);
```

Foreign Data Wrapper

Using the postgres_fdw extension is a bit more of a setup, but makes working with remote tables a breeze
Multistep process:
- Activate the extension
- Create a foreign server object, representing the remote database server
- Create a user wrapping which allows that user to access that foreign server
- Create the foreign tables

The Foreign Server Objects

You first need to define a foreign server

CREATE SERVER |||server_name|||
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (|||server config settings|||)

Server name can be whatever you like, provided it is unique
Server settings are a comma delimited, space separated sequence of host, port, dbname, etc

The User Wrapper

You essentially need to give your user permission to access the remote server

CREATE USER MAPPING FOR CURRENT_USER
SERVER |||server_name|||
OPTIONS (|||user options|||)

Server name will be the name you just used for the created server
User options should include the corresponding remote user info like username and password

Bring On the Tables!

Last but not least, you can import in the tables you want

Can do one at a time with CREATE FOREIGN TABLE, but I far prefer IMPORT FOREIGN SCHEMA

IMPORT FOREIGN SCHEMA |||schema name|||
FROM SERVER |||server_name|||
INTO |||local schema name|||

Can also include LIMIT TO or EXCEPT followed by a list of table names to fine-tune which tables are imported

Nesting Queries

Subqueries

A subquery is simply a query embedded within another piece of SQL
Commonly used to prepare some data or compute a value to be used by the surrounding SQL
We have already seen very simple versions
- Creating a new table, for example
```
CREATE TABLE |||new table||| AS (
  SELECT * FROM |||og table|||
);
```
Often, you could have alternatively saved the subquery as its own table and then used that directly, but not always

Common Uses of Subqueries

Computing a single value to use in comparisons
Computing a single column or list of values to use either directly in a table or in comparisons
Performing “join-like” operations (potentially in a more flexible manner)
Filtering one table based on the contents of another table
Otherwise combining operations that would otherwise have taken several steps

Filter on single value

Think back to HW3 where you needed to filter out all the outlier taxi-ride speeds
- There you had to compute the quartiles, the whisker edges, and then filter directly by typing that number back in
Subqueries can combine those steps! You can directly compare a value to the subquery output so long as the query outputs only a single value
```
SELECT *
FROM |||table 1|||
WHERE |||column 1||| > (
  SELECT avg(|||column 2|||)
  FROM |||table 2|||
);
```

Derived Tables

You can use the output of a subquery anywhere you would normally reference a table name
Such a table is then called a derived table

You must give such subqueries a table alias, else you have no way to refer to them

SELECT *
FROM (
  SELECT |||column 1, column 2|||
  FROM |||table name|||
) AS |||table alias|||;

Repeating Column Values

Sometimes it is very useful to have a constant value assigned to an entire column
- We can’t just use an aggregate function (or similar) though, else we get mismatched column lengths
- Window functions could work, but only if within the same table
You can include a subquery in your selected columns if it outputs only a single value
- That value will be propagated to all the rows

Give it an alias so the column heading has some meaning!

SELECT
  |||column 1|||,
  |||column 2|||,
  (SELECT avg(|||column 3|||) FROM |||table 2|||) as |||column alias|||
FROM |||table 1|||;

Subquery Expressions

Can also use subqueries to filter based on whether certain content is present within the subquery output
Combinations of a keyword followed by the subquery
- |||expr||| IN (|||subquery|||) and |||expr||| NOT IN (|||subquery|||)
- EXISTS (|||subquery|||) and NOT EXISTS (|||subquery|||)
- |||expr||| |||op||| ANY (|||subquery|||)
- |||expr||| |||op||| ALL (|||subquery|||)
|||expr||| is a column name or value, and |||op||| is a boolean operator (=,>,>=,etc)
All subquery expressions are evaluated in short-circuit mode: they will return an answer as soon as they can

IN Subquery

Will check to see if a term appears anywhere in the subquery output
The subquery must output a single column (though it could be empty)
NOT IN just reverses the situation
Be careful of NULLs
- 1 IN (1,2,3) gives True
- 1 IN (2,3) gives False
- 1 IN (NULL,2,3) gives NULL
- Most dangerous with NOT IN, since 1 NOT IN (NULL,2,3) will give NULL, and not True as you might expect

EXISTS Subquery

Checks just to see if the subquery has any rows in the output
- No expression to compare to, just looking at subquery output
The contents of the subquery do not matter at all, so a simple SELECT 1 FROM ... is usually used
To be useful, the subquery is frequently what is called a correlated subquery, where it references the table in the outer query
- These types of subqueries can not be run in isolation
```
SELECT |||column₁, column₂|||
FROM |||table₁|||
WHERE EXISTS (
  SELECT 1
  FROM |||table₂|||
  WHERE |||table₁.column₁||| = |||table₂.column₂|||
);
```

ANY and ALL

At times you don’t want to see if a value is in the subquery output, but rather how it compares to the output
A boolean expression needs to return a single True or False though
ANY and ALL “broadcast” the boolean comparison across all the subquery rows
- ANY will return True if any of the rows evaluate to True with the expression
- ALL will only return True if all of the rows evaluate to True with the expression

|||expr||| = ANY (|||subquery|||) is thus identical to |||expr||| IN (|||subquery|||)

SELECT |||column|||
FROM |||table 1|||
WHERE |||column||| < ALL ( SELECT |||column||| FROM |||table 2|||);

Subqueries vs Joins

You may have realized that subqueries can do many similar things to what we used joins for!
When should each be used?
- Subqueries are great when you only need information from a single table, but what information depends on another table
- Joins are necessary when you need information from multiple tables
Historically, most RDMS were better optimized for joins, but many have improved substantially in recent years for subqueries
- If in doubt, use the form that makes the most sense for what you semantically want to accomplish. You can always convert it later if better optimization is needed and could be achieved

Be careful!

It can be pretty easy to leap off the deep end with subqueries
Always keep in mind what you want the subquery to achieve, and test it individually to make sure it is doing what you expect
- Use them where they fit the needs of the situation
- A nice aspect of subqueries is that you can work from the “bottom-up”, starting with smaller subqueries you know and building on them
Organize your queries nicely!
- This becomes even more important with subqueries

Activity!

I’ve prepped a simple grades database here that you can download and run the sql file to create a subq1 schema in your database of choice which will contain three common tables: roster, assignments, submissions
See if you can answer the questions below without using joins and only subqueries (as practice).
- Which student(s) have turned in nothing?
- Which assignment name has the worst scored average overall?
- Which students scored below average on every quiz?

CTEs

Common Table Expressions

Sometimes you may want to use the same derived tables multiple places in a query
You could copy and paste those subqueries, but there is a better way
Common Table Expressions or CTEs are a way for you to define up front a derived table that can be referenced anywhere in the rest of the query
To do so, you use a WITH keyword:

WITH
  |||derived table name||| (|||new column names|||) AS (
    |||subquery|||
  )
SELECT ...

CTE Details

The number of columns returned by your subquery must match the number of column names you initially define
- If you aren’t going to rename the columns, you do not even really need to include the new column names, but it can be nice for clarity
You do not need to include data types, as these will be inherited from the subquery columns
You can include multiple derived tables in the CTE statement, separated with a comma
Good use of CTE’s can help clarify and simplify queries
- Especially if you have a subquery that you need to reference a few times, use a CTE!

Example

Using a CTE and other subqueries, let’s extract the middle 50% of the taxi_rides by duration.

Pivot Tables

Why Cross Tabulation?

SQL tables tend to operate with observations (records/rows) and variables (columns)
- Commonly called a “tidy” data format
This isn’t always the most human readable format
Cross-tabulations or pivot tables compare variables to other variables

Name	Color	Condition	Number
Apple	Red	Good	5
Orange	Orange	Bad	3
Cherry	Red	Good	10
Banana	Yellow	Good	1
Pineapple	Yellow	Bad	4

Color	Good	Bad
Red	15	0
Orange	0	3
Yellow	1	4

Extensions

Standard SQL has no way to create these cross-tabulations
Postgres has a function to help, but it is located in an extension or module
- Extensions to Postgres are collections of functions, date types, and more which add some functionality to a particular database
To use an extension, you need only enable it for your particular database
```
CREATE EXTENSION |||extension_name|||;
```
To facilitate making pivot tables, we need to add the tablefunc extension to our database, which will add the crosstab function
You can always remove an extension using DROP if you need later

Pieces of the Crosstab

The crosstab function has a lot going on, so let’s break things down

SELECT *
FROM crosstab(
  |||subquery 1 string|||,
  |||subquery 2 string|||
  )
AS (
  |||row label column||| TEXT,
  |||column 1||| |||type 1|||,
  |||column 2||| |||type 2|||,
  ⋮
);

Subquery 1 must return 3 columns
- The first is the row label values
- The second is the column label values
- The third is the particular data values
Subquery 2 returns a single column
- What column categories should rows be placed into?
You still need to define the column names and data types
- First should be the row label name and data type
- Rest should be whatever column categories you are creating

Example

Let’s construct a pivot table comparing the total number of taxi rides at different days of the week across different hours.

A Case Study

On the Case

It can be useful in certain situations to employ a little conditional logic outside a filtering statement
Can be particularly useful to tweak or reclassify values
Standard SQL has the CASE statement to accomplish this
- Essentially works like programming languages if…else if…else blocks

CASE 
  WHEN |||some condition||| THEN |||output|||
  WHEN |||some other condition||| THEN |||different output|||
  ⋮
  ELSE |||fallback output|||
END

A Case Study

Comparisons are made in order, so the first condition that matches, that output is used
If you do not include an ELSE part, then NULL will be output if nothing else matches
CASE statements most often show up in SELECT statements where column outputs are being selected, but they could potentially also show up in filtering or ordering statements
A CASE statement will not evaluate results where the condition is not met, so they can be used to prevent certain errors as well (such as dividing by 0)
```
SELECT
  CASE WHEN a != 0 THEN 5/a END
FROM tablename;
```

Example

Let’s compute how many taxi rides were considered:

“Time Travelers” with a duration of less than 0
“Short Trips” with a duration less than 5 minutes
“Medium Trips” with a duration between 5 and 15 minutes
“Long Trips” with a duration greater than 15 minutes

Coalesce

Similar to CASE, SQL also has the COALESCE statement
COALESCE returns the first non-null value in the list
Frequently used to substitute a default value for NULL values

SELECT
  student_id,
  COALESCE(grade, 0)
FROM grades;

Activity!

Using the same grade book tables as earlier (here if you need them again), do the following:

Create a pivot table for student quiz point scores with full student names as the rows and individual quizes as the columns. You can concatenate two strings in SQL using the || operator. So fname || ' ' || lname would give the first and last name with a space between as a single string. Feel free to use joins here.

student_name	Quiz 1	Quiz 2	Quiz 3
Hailey Gray	1	1	21
⋮	⋮	⋮	⋮

Create a pivot table for student letter grades on each quiz. Parts of this will look really similar to what you did in the first part, but there will be some important differences.