Foreign Subqueries about Pivots

Jed Rembold

March 19, 2025

Announcements

  • Homework
    • HW7 is due tonight
    • HW8 is coming out tonight, due Wed after break
    • I’m spending part of my break getting caught up
  • Today is going to be mostly SQL focused
  • Projects
    • Is your scraper up and running? Ensure that it is before Spring Break starts!
    • Is your automatic cleaner/transformer up and running? Try to have that done before we meet next
    • We’ll be talking the serving components after break

Database Connections

Connecting Databases

  • Postgres has a few ways you can access tables from one database while in another
  • Both are extensions that you would need to activate
    • dblink: Simple for one-off data access
    • postgres_fdw: Better if you need a more permanent connection
  • Both can be useful, so I’ll try to touch on each

The Connection

  • Accessing data using dblink is a two-step process: connect and then query

  • To connect, use dblink_connect (shocking!)

    SELECT dblink_connect(|||connection string|||)
  • The connection string can either be a full database url, or a space delimited sequence of keywords, e.g.:

    'host=localhost user=postgres password=1234'

Querying

  • Once the connection has been made (within a session) you can send across queries

    SELECT * FROM dblink(|||query string|||)
  • dblink is essentially blind, in that it has no idea what would be returned by the query

    • As such, similar to creating a pivot table (coming soon!), you must supply a table alias and columns/types at the end
    SELECT * 
    FROM dblink(|||query string|||)
    AS |||table alias||| ( |||col₁||| |||type₁|||, |||col₂||| |||type₂|||, |||etc.|||);

Executing

  • If you have a command to run remotely, and don’t need a table back, you can use dblink_exec

  • This may be useful for creating, updating or deleting data from a remote table.

    SELECT dblink_exec(
      'CREATE TABLE test (col1 text, col2 int)'
    );

Foreign Data Wrapper

  • Using the postgres_fdw extension is a bit more of a setup, but makes working with remote tables a breeze
  • Multistep process:
    • Activate the extension
    • Create a foreign server object, representing the remote database server
    • Create a user wrapping which allows that user to access that foreign server
    • Create the foreign tables

The Foreign Server Objects

  • You first need to define a foreign server

    CREATE SERVER |||server_name|||
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (|||server config settings|||)
  • Server name can be whatever you like, provided it is unique

  • Server settings are a comma delimited, space separated sequence of host, port, dbname, etc

The User Wrapper

  • You essentially need to give your user permission to access the remote server

    CREATE USER MAPPING FOR CURRENT_USER
    SERVER |||server_name|||
    OPTIONS (|||user options|||)
  • Server name will be the name you just used for the created server

  • User options should include the corresponding remote user info like username and password

Bring On the Tables!

  • Last but not least, you can import in the tables you want

  • Can do one at a time with CREATE FOREIGN TABLE, but I far prefer IMPORT FOREIGN SCHEMA

    IMPORT FOREIGN SCHEMA |||schema name|||
    FROM SERVER |||server_name|||
    INTO |||local schema name|||
  • Can also include LIMIT TO or EXCEPT followed by a list of table names to fine-tune which tables are imported

Nesting Queries

Subqueries

  • A subquery is simply a query embedded within another piece of SQL
  • Commonly used to prepare some data or compute a value to be used by the surrounding SQL
  • We have already seen very simple versions
    • Creating a new table, for example
    CREATE TABLE |||new table||| AS (
      SELECT * FROM |||og table|||
    );
  • Often, you could have alternatively saved the subquery as its own table and then used that directly, but not always

Common Uses of Subqueries

  • Computing a single value to use in comparisons
  • Computing a single column or list of values to use either directly in a table or in comparisons
  • Performing “join-like” operations (potentially in a more flexible manner)
  • Filtering one table based on the contents of another table
  • Otherwise combining operations that would otherwise have taken several steps

Filter on single value

  • Think back to HW3 where you needed to filter out all the outlier taxi-ride speeds

    • There you had to compute the quartiles, the whisker edges, and then filter directly by typing that number back in
  • Subqueries can combine those steps! You can directly compare a value to the subquery output so long as the query outputs only a single value

    SELECT *
    FROM |||table 1|||
    WHERE |||column 1||| > (
      SELECT avg(|||column 2|||)
      FROM |||table 2|||
    );

Derived Tables

  • You can use the output of a subquery anywhere you would normally reference a table name

  • Such a table is then called a derived table

  • You must give such subqueries a table alias, else you have no way to refer to them

    SELECT *
    FROM (
      SELECT |||column 1, column 2|||
      FROM |||table name|||
    ) AS |||table alias|||; 

Repeating Column Values

  • Sometimes it is very useful to have a constant value assigned to an entire column

    • We can’t just use an aggregate function (or similar) though, else we get mismatched column lengths
    • Window functions could work, but only if within the same table
  • You can include a subquery in your selected columns if it outputs only a single value

    • That value will be propagated to all the rows
  • Give it an alias so the column heading has some meaning!

    SELECT
      |||column 1|||,
      |||column 2|||,
      (SELECT avg(|||column 3|||) FROM |||table 2|||) as |||column alias|||
    FROM |||table 1|||;

Subquery Expressions

  • Can also use subqueries to filter based on whether certain content is present within the subquery output
  • Combinations of a keyword followed by the subquery
    • |||expr||| IN (|||subquery|||) and |||expr||| NOT IN (|||subquery|||)
    • EXISTS (|||subquery|||) and NOT EXISTS (|||subquery|||)
    • |||expr||| |||op||| ANY (|||subquery|||)
    • |||expr||| |||op||| ALL (|||subquery|||)
  • |||expr||| is a column name or value, and |||op||| is a boolean operator (=,>,>=,etc)
  • All subquery expressions are evaluated in short-circuit mode: they will return an answer as soon as they can

IN Subquery

  • Will check to see if a term appears anywhere in the subquery output
  • The subquery must output a single column (though it could be empty)
  • NOT IN just reverses the situation
  • Be careful of NULLs
    • 1 IN (1,2,3) gives True
    • 1 IN (2,3) gives False
    • 1 IN (NULL,2,3) gives NULL
    • Most dangerous with NOT IN, since 1 NOT IN (NULL,2,3) will give NULL, and not True as you might expect

EXISTS Subquery

  • Checks just to see if the subquery has any rows in the output
    • No expression to compare to, just looking at subquery output
  • The contents of the subquery do not matter at all, so a simple SELECT 1 FROM ... is usually used
  • To be useful, the subquery is frequently what is called a correlated subquery, where it references the table in the outer query
    • These types of subqueries can not be run in isolation
    SELECT |||column₁, column₂|||
    FROM |||table₁|||
    WHERE EXISTS (
      SELECT 1
      FROM |||table₂|||
      WHERE |||table₁.column₁||| = |||table₂.column₂|||
    );

ANY and ALL

  • At times you don’t want to see if a value is in the subquery output, but rather how it compares to the output

  • A boolean expression needs to return a single True or False though

  • ANY and ALL “broadcast” the boolean comparison across all the subquery rows

    • ANY will return True if any of the rows evaluate to True with the expression
    • ALL will only return True if all of the rows evaluate to True with the expression
  • |||expr||| = ANY (|||subquery|||) is thus identical to |||expr||| IN (|||subquery|||)

    SELECT |||column|||
    FROM |||table 1|||
    WHERE |||column||| < ALL ( SELECT |||column||| FROM |||table 2|||);

Subqueries vs Joins

  • You may have realized that subqueries can do many similar things to what we used joins for!
  • When should each be used?
    • Subqueries are great when you only need information from a single table, but what information depends on another table
    • Joins are necessary when you need information from multiple tables
  • Historically, most RDMS were better optimized for joins, but many have improved substantially in recent years for subqueries
    • If in doubt, use the form that makes the most sense for what you semantically want to accomplish. You can always convert it later if better optimization is needed and could be achieved

Be careful!

  • It can be pretty easy to leap off the deep end with subqueries
  • Always keep in mind what you want the subquery to achieve, and test it individually to make sure it is doing what you expect
    • Use them where they fit the needs of the situation
    • A nice aspect of subqueries is that you can work from the “bottom-up”, starting with smaller subqueries you know and building on them
  • Organize your queries nicely!
    • This becomes even more important with subqueries

Activity!

  • I’ve prepped a simple grades database here that you can download and run the sql file to create a subq1 schema in your database of choice which will contain three common tables: roster, assignments, submissions
  • See if you can answer the questions below without using joins and only subqueries (as practice).
    • Which student(s) have turned in nothing?
    • Which assignment name has the worst scored average overall?
    • Which students scored below average on every quiz?

CTEs

Common Table Expressions

  • Sometimes you may want to use the same derived tables multiple places in a query
  • You could copy and paste those subqueries, but there is a better way
  • Common Table Expressions or CTEs are a way for you to define up front a derived table that can be referenced anywhere in the rest of the query
  • To do so, you use a WITH keyword:
WITH
  |||derived table name||| (|||new column names|||) AS (
    |||subquery|||
  )
SELECT ...

CTE Details

  • The number of columns returned by your subquery must match the number of column names you initially define
    • If you aren’t going to rename the columns, you do not even really need to include the new column names, but it can be nice for clarity
  • You do not need to include data types, as these will be inherited from the subquery columns
  • You can include multiple derived tables in the CTE statement, separated with a comma
  • Good use of CTE’s can help clarify and simplify queries
    • Especially if you have a subquery that you need to reference a few times, use a CTE!

Example

Using a CTE and other subqueries, let’s extract the middle 50% of the taxi_rides by duration.

Pivot Tables

Why Cross Tabulation?

  • SQL tables tend to operate with observations (records/rows) and variables (columns)
    • Commonly called a “tidy” data format
  • This isn’t always the most human readable format
  • Cross-tabulations or pivot tables compare variables to other variables
Name Color Condition Number
Apple Red Good 5
Orange Orange Bad 3
Cherry Red Good 10
Banana Yellow Good 1
Pineapple Yellow Bad 4
Color Good Bad
Red 15 0
Orange 0 3
Yellow 1 4

Extensions

  • Standard SQL has no way to create these cross-tabulations

  • Postgres has a function to help, but it is located in an extension or module

    • Extensions to Postgres are collections of functions, date types, and more which add some functionality to a particular database
  • To use an extension, you need only enable it for your particular database

    CREATE EXTENSION |||extension_name|||;
  • To facilitate making pivot tables, we need to add the tablefunc extension to our database, which will add the crosstab function

  • You can always remove an extension using DROP if you need later

Pieces of the Crosstab

  • The crosstab function has a lot going on, so let’s break things down
SELECT *
FROM crosstab(
  |||subquery 1 string|||,
  |||subquery 2 string|||
  )
AS (
  |||row label column||| TEXT,
  |||column 1||| |||type 1|||,
  |||column 2||| |||type 2|||,
  ⋮
);
  • Subquery 1 must return 3 columns
    • The first is the row label values
    • The second is the column label values
    • The third is the particular data values
  • Subquery 2 returns a single column
    • What column categories should rows be placed into?
  • You still need to define the column names and data types
    • First should be the row label name and data type
    • Rest should be whatever column categories you are creating

Example

Let’s construct a pivot table comparing the total number of taxi rides at different days of the week across different hours.

A Case Study

On the Case

  • It can be useful in certain situations to employ a little conditional logic outside a filtering statement
  • Can be particularly useful to tweak or reclassify values
  • Standard SQL has the CASE statement to accomplish this
    • Essentially works like programming languages if…else if…else blocks
CASE 
  WHEN |||some condition||| THEN |||output|||
  WHEN |||some other condition||| THEN |||different output|||
  ⋮
  ELSE |||fallback output|||
END

A Case Study

  • Comparisons are made in order, so the first condition that matches, that output is used

  • If you do not include an ELSE part, then NULL will be output if nothing else matches

  • CASE statements most often show up in SELECT statements where column outputs are being selected, but they could potentially also show up in filtering or ordering statements

  • A CASE statement will not evaluate results where the condition is not met, so they can be used to prevent certain errors as well (such as dividing by 0)

    SELECT
      CASE WHEN a != 0 THEN 5/a END
    FROM tablename;

Example

Let’s compute how many taxi rides were considered:

  • “Time Travelers” with a duration of less than 0
  • “Short Trips” with a duration less than 5 minutes
  • “Medium Trips” with a duration between 5 and 15 minutes
  • “Long Trips” with a duration greater than 15 minutes

Coalesce

  • Similar to CASE, SQL also has the COALESCE statement
  • COALESCE returns the first non-null value in the list
  • Frequently used to substitute a default value for NULL values
SELECT
  student_id,
  COALESCE(grade, 0)
FROM grades;

Activity!

Using the same grade book tables as earlier (here if you need them again), do the following:

  • Create a pivot table for student quiz point scores with full student names as the rows and individual quizes as the columns. You can concatenate two strings in SQL using the || operator. So fname || ' ' || lname would give the first and last name with a space between as a single string. Feel free to use joins here.
student_name Quiz 1 Quiz 2 Quiz 3
Hailey Gray 1 1 21
  • Create a pivot table for student letter grades on each quiz. Parts of this will look really similar to what you did in the first part, but there will be some important differences.

// reveal.js plugins // Added plugins