---
title: "Scraping a Railway on Schedule"
author: Jed Rembold
date: Mar 5, 2025
slideNumber: true
theme: nord_light
highlightjs-theme: nord
width: 1920
height: 1080
transition: slide
---


## Announcements
:::smaller
- Midterm video deadline tonight!
- Upcoming Project Starts
  - We are coming up on a point where it would be useful to strategize with your partner about possible project ideas
  - I'm sending out a poll later tonight about who you might want (or not want) to work with, and then I'll make the pairings and announce them tomorrow.
  - You will be setting up (individually) a scraper in the coming week to be collecting data on a daily basis from some website or API
  - You should strategize with your partner about what two difference sources you both can be collecting data for that would have interesting relationships that you could later analyze
  - Most obvious join options would be date, but could be more interesting relationships depending on what you are both gathering
- No homework this coming week! Just getting started on the project!
:::

<!-------- Scraping ----------------->
# Gathering Data
## Webscraping
- Webscraping is the act of extracting information from a URL so that it can be collected or otherwise used elsewhere
- It can take multiple forms, with varying degrees of complexity (most of which depends on the website)
  - Extracting information from a table of data
  - Extracting other information on a webpage that is not necessarily formatted
  - Extracting information using a provided API endpoint
- The method we'll showcase today will let you approach situations where you want to grab information from a particular table on a website, or in accessing an API


## Web2DB
- To focus on the acquisition of data, without worrying about the scripting details, I have created a Docker container you can use
  ```bash
  docker pull jrembold/web2db:latest
  ```
- This container will handle:
  - Grabbing a static table from a webpage, or
  - Retrieving a response from an API endpoint, and
  - Optionally: Writing the data as JSON to a file
  - Optionally: Adding the data as JSON to an indicated PostgreSQL table
  - Optionally: scheduling the same operation to run with a provided frequency


## Controlling the Environment
- You control the behavior of the Web2DB container through environment variables
  - Keeps you have needing to build your own images on top for customization
- Usually a good idea to store secrets such as database passwords in environment variables anyway
- **Locally**, the easiest way to handle the variables is to create a `vars.env` file and provide it to the docker container
  ```bash
  docker run --rm --env-file vars.env web2db
  ```
- You _can_ specify them individually, one after each `-e` flag, if you really wanted
  ```bash
  docker run --rm -e SITE_URL=www.google.com web2db
  ```

## Web2DB Environment Variables
- web2db recognizes (or requires) 6 environment variables:
  - `SITE_URL` (required)
  - `TABLE_NUM` (situational/optional)
  - `FILE_NAME` (optional)
  - `DATABASE_URL` (optional)
  - `TABLE_NAME`{.text} (situational)
  - `DEBUG`{.text} (optional)
  - `HEADER_KEYS` (optional)
- When providing in `vars.env`, use no spaces:
  ```bash
  SITE_URL=https://willamette.edu/~jjrembold  
  ```

## SITE_URL for Tables
- If you are trying to scrape a _static_ table off a site, provide:
  - The URL of the site as `SITE_URL`
  - Indicate the number of which table you want as `TABLE_NUM`
    - Defaults to 0 for the first table on the page
- The table will get converted to JSON format, with each row containing the column headers as keys
- If a table is populated from a backend database on loading the webpage (a dynamic table) this method will unfortunately not work


## Understanding APIs
- Instead of posting a table on a webpage, information providers might make available a public API where you can access the information
- Most REST APIs look just like a web address, but if you navigate to that URL, instead of getting HTML to render a webpage, you get the data directly, most often in a JSON format
  - Many APIs also let you add extra information to the URL to better specify _exactly_ what information you want back
  - These generally take the form of parameter strings, and come after a `?` in the URL
- Some APIs will require you to register for a key, which is often free. This is to safeguard against people slamming their servers with billions of requests. **Be respectful in both your API and webscraping usage!**


## Understanding Endpoints
- Many API's will offer a variety of different _endpoints_
- An endpoint is essentially just a specific URL that will get you a particular piece of information
- How do you know what endpoints are available?
  - Classic RTFM: You read the @#%!ing manual
  - Any API worth anything will have extensive documentation describing what endpoints are available and what customizations you can apply to each endpoint to get the data you want


## SITE_URL for APIs
- web2db is currently configured to handle REST GET APIs
  - I may be able to add `POST` functionality if needed
- You just need to provide the desired endpoint to `SITE_URL`
- If you have parameters to include at the end (following a `?`), **include them in your URL**
- web2db assumes that the response will be in JSON, which is probably the case in over 95% of APIs, but it will break if given something in another format

## HEADER KEYS
- Some APIs may demand that you provide certain information in what is known as a _header key_
  - This is an additional source of information that the API server receives along with the URL
- Sometimes used to indicate authentication keys or type of response requested
- Provided as a string version of a key-value dictionary:
  ```json
  "{'key_1':13, 'key_2': 'Jed'}"
  ```

## Writing to File
- If you include the `FILE_NAME` environment variable, the data will be written to a newline JSON file with this name in the container's `/app/output` folder
- This format is perfect for copying directly into Postgres tables using `COPY`
  - Instead of having a list of entries, it has 1 entry on each new line
- If you want to access these files from outside your container, you'll need to set up a volume bind!


## Writing to Database
- web2db can write the information directly to your database, provided you fill out the `DATABASE_URL` and `TABLE_NAME`{.text} variables
- `DATABASE_URL` contains **all** the info to connect to your database
  - Form: `username:password@hostname:port/database`{.text}
- `TABLE_NAME`{.text} should be a table that _already_ exists in the database and has a column named `raw_json`
  - It can have other columns, but web2db will only fill out the `raw_json` column
  - I like adding a timestamp column with a `DEFAULT now()` to automatically add the time each entry is added to the database
- If connecting to a local database, you **may** need to include `--network=host`{.text} option when running the docker container


## Practice!
- See if you can do the following:
  - Create a table in your local database with at least one column with the `jsonb` data type
  - Use the Docker container to dump the table containing our class schedule [here](https://jrembold.github.io/Website_Backup/classes/data503/hw/) into your Postgres table
  - Use the Docker container to dump the results from the API endpoint [here](http://api.open-notify.org/astros.json) into your Postgres table. The API gives information about all the humans currently in space.
- Note that although these two results have very different schema, you can dump them both into the same `jsonb` column in Postgres!


<!--------  THEORY ------------------>
<!--
## An Interface
- SQL is brilliant for working within our databases
  - It is not generally as flexible to handle other common scripting tasks
- You may at times need an _interface_ between a common scripting language (say R or Python) and SQL
- These interfaces generally allow two major things to happen:
  - You can trigger certain SQL commands to happen in the database from within your other script
  - You can query results _out_ of a database using SQL and bring them directly into your scripting language, ready for use


## All about the connection
- Regardless of the scripting language, an interface is generally built around an initial _connection_ object that gets defined in the language
- The connection object is initially formed by requiring information about the database you desire to connect to
  - IP address / Port
  - Database name
  - User you'll be connecting as
  - User password
- All other commands will use the connection object as a method to communicate with the database
- Frequently will form one connecting at the start and then maintain it, but you could connect and disconnect as needed


## Package Management
- Tidyverse already has the dbplyr package for working with databases, which brings along with it the necessary DBI package
  - The DBI package is the one responsible for facilitating the connection and providing many common functions for interacting with the database
  - Other packages are built on top of DBI to allow connections to different types of databases, all using a similar syntax
- You will also need `RPostgreSQL`, which is built on top of DBI to allow for connections to PostgreSQL databases


## Making a connection
- The `dBConnect` command is used to create a connection object
- Needs at least one input, corresponding to a DBI object representing the type of database to connect to
  - For Postgres, that can be `RPostgreSQL::PostgreSQL()`
- Further parameters can customize aspects of what database is connected to
  - `dbname` for the database name
  - `host` for the IP address (or localhost if on same system)
  - `port` for a specific port
  - `user` for the connecting username
  - `password` for the corresponding password
- Save the connection object as something! You will need it!


## Example Connection
- Connecting to a generic postgres database living on the same computer:
  ```R
  con <- dbConnect(RPostgreSQL::PostgreSQL(),
                   dbname = 'analysis',
                   host = 'localhost',
                   port = 5432,
                   user = 'postgres',
                   password = 'postgres')
  ```

## Reading information
- Once you have a connection, you can query the database and bring that information into R!
- Several approaches
  - Using DBI directly:
    ```R
    output <- dbGetQuery(con, "SELECT * FROM pokemon")
    ```
    - This reads the resulting table directly into a dataframe stored in local memory
  - Using dbplyr:
    ```R
    output <- tbl(con, sql("SELECT * FROM pokemon"))
    ```
    - This stores in the `output` a database table object, which is essentially a map to the information in the database
    - You can filter, aggregate, etc on this object as if it were a local dataframe, but _it is actually still on the server_!
    - dplyr converts all the R commands into corresponding SQL that is run serverside


## Writing information
- For writing information to the database, or any other SQL commands that don't return a table, things are more straightforward (kinda)
- I would recommend using `dbExecute`, as I've found it to be the most straightforward and convenient
- Takes two required arguments:
  - The desired connection object
  - A string containing the SQL you desire to run
  ```R
  dbExecute(
    con, 
    "INSERT INTO test VALUES ('Jed', 1985)"
  )
  ```


## Parameterizing Information
- Often, you have gotten information from another source that you want to embed inside an SQL query
- You could do this directly through string processing, using something like `paste` or `glue`
  - Doing this blindly though can open you up to SQL-Injection attacks!
- A better way is through _query parameterization_
  - Indicate placeholders in the query where you want information to be subbed in, and then supply that information
  - The interface library then takes care of sanitizing the input and making sure nothing nefarious is going on before inserting the desired strings


## RPostgresql Parameterization
- Uses ordered placeholders using a dollar sign followed by a number: `$1`, `$2`, etc
  - This is a bit different than many implementations, which let you just use `?` marks
- Can provide a parameterization list after the query:
  ```R
  dbExecute(con, "INSERT INTO test VALUES ($1, $2)", list('Jed', 1985))
  ```
- Could also provide a 1-row tibble of the necessary length
- In RPostgreSQL, there seems to be no way to do bulk inserts using parameterization, unfortunately
- Can also work with `dbGetQuery` statements:
  ```R
  dbGetQuery(con, "SELECT * FROM pokemon WHERE hp > $1", params=list(50))
  ```

## Sleeping
- For long running data acquisition scripts, it is often useful to have the program "wait" a set amount of time between grabbing data
- You can do this in R with the `Sys.sleep(duration)`
  - Duration is given in seconds
- Such statements are generally embedded in long running or infinite loops:
  ```R
  while (TRUE) {
    ...do something cool...
    Sys.sleep(60) # Wait a minute 
  }
  ```

## Activity
- Take some time now to:
  - From within R, create a new table in your database of choice
    - You can name it anything you want, and have whatever columns you want
    - Ensure after creating it that you can access and query this table from your client of choice!
  - From within R, read in a dataframe of all the number of superheros who have each available superhero power and use that to create a histogram in ggplot.
-->

# Break!
## Break Time!
<!--![](../images/nomming_catepillar.png)-->
![](https://64.media.tumblr.com/tumblr_lpc46oU6Cy1qi1pnpo1_500.gifv){width=50%}

# Online Deployments
## Railway Introduction
- There are many different webhosting options these days
- For this class, I was looking for:
  - Something fairly straightforward
  - Something free or very cheap
  - Something that could easily handle both a database and data acquisition
- [Railway.app](https://railway.app/) is what I found


## Railway Details
- As a registered new user, you get $5 free credit that does not expire
  - For what we are using, monthly costs are usually around $2-$2.50, so you _probably_ will not need to pay anything to get through the end of the semester
- Can spin up a Postgres database easily, in the cloud
- Can also launch a Docker container
  - Can build an image directly from a Dockerfile and supporting files in a GitHub repository
  - Can launch an image directly from DockerHub
- You can sign up for Railway using your GitHub account

## Creating a DB
- Creating a new Postgres database is drop dead simple:
  - Create a new project
  - Select "Provision PostgreSQL"
  - Give it a moment, and then it will be up and running!
- You can click on it in your project to bring up options where you can:
  - Add tables
  - Query the data directly
  - Get the all important connection information
- Test it! You should be able to connect using this connection information from your client of choice!

## Deploying a Docker Container: Easy Way
- Recently, Railway added the option to deploy a Docker container directly from an image hosted on DockerHub
- This makes it **very** easy to deploy the `jrembold/web2db` image!
- When it first is added to your project, it will not be running. It gives you some time to make any adjustments or corrections before you can then hit "Deploy" to get it going.

## Deploying a Custom Docker Container
- Railway can deploy Docker containers by building a Dockerfile in a provided GitHub repository
- In your project, select "New" and then choose "GitHup Repo"
- You'll then need to choose Configure/Install GitHub App
  - This will link Railway.app to your GitHub account
  - You can choose to either link it to all your repositories, or, like me, choose specific repositories
  - In that repository, all you need is a Dockerfile and any supporting files that you would normally need to build the image
- Whenever you commit a new change to that repo, Railway will automatically rebuild the Docker image and redeploy it, which is pretty slick

## Railway Environmental Variables
- Railway has available all the database environment variables, but you just need to import which ones you need over to the web2db image (the `DATABASE_URL` one!)
- Select the Docker container and then click on the "variables" tab
- Here you can click the `+ New Variable`{.text} button to add the same variables and corresponding values you had in your `var.env` file!
  - This way you never need to have something like a password or similar saved in text as part of your deployment
  - If you make changes to these variables, Railway will automatically restart the Docker container to ensure it sees the new values!

## Scheduling with Railway
- By design, the Docker container runs the program once and then exits
- Especially on remote systems, this is an efficient use of materials
- Often though, you want to scrape new data on a particular interval, so we need a way to "wake up" and run the Docker container again
- This is done in Railway by setting a _CRON Schedule_
  - This will effectively run redeploy the container and run the program at some given interval
  - Requires specifying a schedule with some special CRON syntax


## CRON Scheduling

![From [crontogo.com](https://crontogo.com/blog/the-complete-guide-to-cron/). Also useful is [this site](https://crontab.guru/)!](../images/cron_expression_syntax.png){width=70%}


<!---------------- SQL ------------------>
# Regrouping to SQL
## SHOWDOWN TIME!
- Teams to be announced on next slide
- Whoever want to be in charge of submitting your answer can change your polling name to a fun team name :)
- Only one computer can be used by the team at a time
  - We'll be using the superhero dataset from HW4
  - Another computer can have the ERD open if you want
  - 4 rounds: each round switch who is typing
- You'll have 5 minutes to answer each question. After 2 minutes you can submit a _single_ answer.
  - Submitting an answer sooner gets you more points, but you have to wait at least the two minutes.


## Showdown Teams
:::::cols 
::::col 
- Nate, Landon, Joshue
- Ryan, Kaiona, Sam P
- Jace, Charlie, Sammy T
- Sarah, Rohan, Kendall
- Hriday, Jeffrey, Cameron
::::
::::col
- Jacob, Tatum, Logan
- Andrew, Simon, Brandon
- Endy, Paxton, Dane 
- Alexander, Dylan
::::
:::::


# A Question of Timing

## Date-Time Reminders
- We have about 4 current date or time related data types
- Date time types:
  - `DATE`: holds a single individual day
  - `TIME`: holds a single individual time
  - `TIMESTAMP` or variants with `TIMESTAMPTZ`: holds a combination of date and time, along with a potential time zone
- Interval types:
  - `INTERVAL`: holds a duration of time


## Extracting Pieces
- Having all the information in one value is convenient, but sometimes you only need pieces
  - The hour from the time, or the month from the date
- These can be particularly important with aggregates!
- Two methods to extract pieces of any datetime or interval type:
  - SQL standard: `extract( part FROM |||datetime_value||| )`{.pgsql}
  - Postgres specific: `date_part( |||part|||, |||datetime_value||| )`
- Both will return a `DOUBLE PRECISION` value of whatever part was requested


## Parts To Extract
- You have a wide variety of what you can extract


:::{style='font-size:.60em'}

::::::cols
::::col
| text         | Description                                                 |
|--------------|-------------------------------------------------------------|
| century      | What century the date is in. 1st century starts 0001-01-01[★]{.orange} |
| day          | What day of the month                                       |
| decade       | The year divided by 10                                      |
| dow          | The day of the week (0-6, starting with Sunday)             |
| doy          | The day of the year                                         |
| epoch        | Number of seconds since 1970-01-01                          |
| hour         | The current hour (0-23)                                     |
| microseconds | The number of microseconds                                  |

::::

::::col
| text          | Description                                      |
|---------------|--------------------------------------------------|
| milliseconds  | The number of milliseconds                       |
| minute        | The minute                                       |
| month         | The month (1-12)                                 |
| quarter       | What quarter of the year (1-4)                   |
| second        | The number of seconds                            |
| timezone      | The timezone offset in seconds                   |
| timezone_hour | The timezone offset in hours                     |
| week          | What week of the year. ISO weeks start on Monday |
| year          | The year                                         |

::::
::::::


:::
<hr>
:::{style='font-size:.5em'}
[★]{.orange} -- If you disagree with this, please write your complaint to: Pope, Cathedral Saint-Peter of Roma, Vatican. 
:::


## Reversing it
:::{style='font-size:.9em'}
- Often times existing data sets have already separated out different aspects of the date or time
  - Year, month, and day might be in different columns for example
- It can be useful to "stitch" these together into an actual datetime type for further use.
- Postgres gives you a handful of functions to do so:
  - `make_date( |||year|||, |||month|||, |||day|||)`: Returns a new `DATE` type value
  - `make_time( |||hour|||, |||minute|||, |||seconds||| )`: Returns a new `TIME` type value (with no timezone)
  - `make_timestamptz(|||year|||,|||month|||,|||day|||,|||hour|||,|||minute|||,|||second|||,|||time zone|||)`: Returns a new `TIMESTAMPTZ` type value
  - `make_timestamp` and `make_interval` also exist
:::
## Aging well
- Subtracting two `DATE` type values will give just an `INT` (in days)
- Subtracting two `TIMESTAMP` type values will give an `INTERVAL`, with the biggest "unit" in days
- Using Postgres's `age()` function can smooth over both and give units larger than days
  - `age( |||datetime₁|||, |||datetime₂||| )`: Subtracts datetime2 from datetime1
- This can **still** give you awkward interval units at times though, so also consider using `justify_interval( |||interval||| )`, which breaks intervals into divisions that don't exceed a categories max
  - Hours would always be between 0 and 23 for instance, or months between 1 and 12
  - Especially if you want to extract a particular part, this is highly recommended


## What time is it?
:::smaller

- Standard SQL also provides constants for grabbing the current system time and date

| function                          | description                                        |
|-----------------------------------|----------------------------------------------------|
| `current_date`                  | Returns the current date                           |
| `current_time`                  | Returns the current time with timezone             |
| `localtime`                     | Returns the current time without timezone          |
| `current_timestamp`[★]{.orange} | Returns the current date and time with timezone    |
| `localtimestamp`                | Returns the current date and time without timezone |

[★]{.orange} -- Postgres also offers the shorter `now()` _function_ to do the same thing

:::


<!--## Current vs Clock-->
<!--- Any query using `current_timestamp` has it computed once at the start of a query-->
  <!--- This is frequently desired for logging, so that you just get 1 consistent time for any records added from a single query-->
<!--- If you **want** record-by-record time keeping, you should use `clock_timestamp()` instead, which will work the same way but be updated before every value written to the table-->


## Time Zones
- Dealing with time zones can be a headache, and it is a very nice feature that Postgres can work with them smoothly
- By default, Postgres will _display_ any timestamp with a time zone with the time as you would measure it in your current system timezone
- What is your current system timezone?
  - `SHOW timezone;`
- Getting general information about timezones:
  - Getting abbreviations:
    - `SELECT * FROM pg_timezone_abbrevs;`
  - Getting full names:
    - `SELECT * FROM pg_timezone_names;`


## Teleportation
- It can sometimes be useful to switch your "current" time zone
  - Maybe it is easier to compare times to someone else living in that time zone
- Several methods to make the switch:
  - Change your `postgressql.conf` file, which controls your Postgres server. Only recommended if you have permanently moved elsewhere and the database time zone has not updated appropriately.
  - Set future queries in a single session to be from a new timezone:
    
    `SET |||timezone||| TO |||time_zone_name_or_abbrv|||;`
    - This will also adjust what values your `localtime` or `localtimestamp` report!
  - Transform a single query to be reported in a different time zone:
    
    ```pgsql
    SELECT |||dt_col_name||| AT TIME ZONE |||tz_name_or_abbrv|||
    FROM |||tablename|||;
    ```

## Activity
- Using the taxi rides dataset, see if you can:
  - Compute the number of rides given each hour of the day
  - Compute the average cost of rides over each day of the month
  - Compute the median cost of rides over each day of the week
  - Compute the average duration of each ride over each hour of the day


## Results to compare against
![](../images/taxi_datetime_activity_results.png){width=70%}


<!--
## Subqueries
- A _subquery_ is simply a query embedded within another piece of SQL
- Commonly used to prepare some data or compute a value to be used by the surrounding SQL
- We have already seen very simple versions
  - Creating a new table, for example
    
  ```pgsql
  CREATE TABLE new_table AS (
    SELECT * FROM og_table
  );
  ```
- Often, you could have saved the subquery as its own table and then used that directly, but not always


## Common Uses of Subqueries
- Computing a single value to use in comparisons
- Computing a single column or list of values to use either directly in a table or in comparisons
- Performing "join-like" operations (potentially in a more flexible manner)
- Filtering one table based on the contents of another table
- Otherwise combining operations that would otherwise have taken several steps


## Filter on single value
- Think back to HW3 where you needed to filter out all the outlier taxi-ride speeds
  - There you had to compute the quartiles, the whisker edges, and then filter directly by typing that number back in
- Subqueries can combine those steps! You can directly compare a value to the subquery output **so long as the query outputs only a single value**

```{.pgsql style='font-size:1em'}
SELECT *
FROM tablename
WHERE colname > (
  SELECT avg(colname2)
  FROM tablename2
);
```

## Derived Tables
- You can use the output of a subquery anywhere you would normally reference a table name
- Such a table is then called a _derived table_
- You **must** give such subqueries a table alias, else you have no way to refer to them

```pgsql
SELECT *
FROM (
  SELECT col1, col2
  FROM tablename
) AS mytable; -- Worlds most boring table alias
```

## Repeating Column Values
- Sometimes it is very useful to have a constant value assigned to an entire column
  - We can't just use an aggregate function (or similar) though, else we get mismatched column lengths
- You can include a subquery _in your selected columns_ if it outputs only a **single value**
  - That value will be propagated to all the rows
- Give it an alias so the column heading has some meaning!

```{.pgsql style='font-size:.9em'}
SELECT
  col1,
  col2,
  (SELECT avg(col2) FROM tablename2) as const_value
FROM tablename;
```


## Subquery Expressions
- Can also use subqueries to filter based on whether certain content is present within the subquery output
- Combinations of a keyword followed by the subquery
  - `expr IN (subquery)` and `expr NOT IN (subquery)`
  - `EXISTS (subquery)` and `NOT EXISTS (subquery)`
  - `expr op ANY (subquery)`
  - `expr op ALL (subquery)`
- `expr` is a column name or value, and `op` is a boolean operator (`=,>,>=`,etc)
- All subquery expressions are evaluated in short-circuit mode: they will return an answer as soon as they can


## IN Subquery
- Will check to see if a term appears anywhere in the subquery output
- The subquery must output a **single column** (though it could be empty)
- `NOT IN` just reverses the situation
- **Be careful of NULLs**
  - `1 IN (1,2,3)` gives True
  - `1 IN (2,3)` gives False
  - `1 IN (NULL,2,3)` gives `NULL`
  - Most dangerous with `NOT IN`, since `1 NOT IN (NULL,2,3)` will give `NULL`, and not True as you might expect


## EXISTS Subquery
- Checks just to see if the subquery has **any** rows in the output
  - No expression to compare to, just looking at subquery output
- The _contents_ of the subquery do not matter at all, so a simple `SELECT 1 FROM ...` is usually used
- To be useful, the subquery is frequently what is called a _correlated subquery_, where it references the table in the outer query
  - These types of subqueries can **not** be run in isolation

```{.pgsql style='font-size:.9em'}
SELECT col1, col2
FROM table1
WHERE EXISTS (
  SELECT 1
  FROM table2
  WHERE table1.col1 = table2.col2
);
```


## ANY and ALL
- At times you don't want to see if a value is _in_ the subquery output, but rather how it _compares_ to the output
- A boolean expression needs to return a single True or False though
- `ANY` and `ALL` "broadcast" the boolean comparison across all the subquery rows
  - `ANY` will return True if _any_ of the rows evaluate to True with the expression
  - `ALL` will only return True if _all_ of the rows evaluate to True with the expression
- `expr = ANY (subquery)` is thus identical to `expr IN (subquery)`
```{.pgsql style='font-size:.9em'}
SELECT col1
FROM table1
WHERE col1 < ALL ( SELECT col2
                   FROM table2);
```

## Subqueries vs Joins
- You may have realized that subqueries can do many similar things to what we used joins for!
- When should each be used?
  - Subqueries are great when you only need information from a single table, but what information depends on another table
  - Joins are necessary when you need information from multiple tables
- Historically, most RDMS were better optimized for joins, but many have improved substantially in recent years for subqueries
  - If in doubt, use the form that makes the most sense for what you _semantically_ want to accomplish. You can always convert it later if better optimization is needed and could be achieved


## Be careful!
- It can be pretty easy to leap off the deep end with subqueries
- Always keep in mind what you want the subquery to achieve, and test it individually to make sure it is doing what you expect
  - Use them where they fit the needs of the situation
  - A nice aspect of subqueries is that you can work from the "bottom-up", starting with smaller subqueries you know and building on them
- **Organize your queries nicely!**
  - This becomes even more important with subqueries


## Pair Activity!
- I've prepped a simple grades database [here](../activity_data/assignment_submission.sql) that you can download and run the sql file to create a `mar09` schema in your database of choice which will contain three common tables: `roster`, `assignments`, `submissions`
- See if you can answer the following **without using joins** and only subqueries (as practice).
  - Which student(s) have turned in _nothing_?
  - Which assignment name has the worst scored average overall?
  - Which students scored below average on every quiz?
- Pairings
  - Karol and Jullian
  - Megan and Rochelle
  - Nina and Ellie
  - Andrew and Liam
  - Maribel and Connor

## Common Table Expressions
- Sometimes you may want to use the same derived tables multiple places in a query
- You could copy and paste those subqueries, but there is a better way
- _Common Table Expressions_ or CTEs are a way for you to define up front a derived table that can be referenced anywhere in the rest of the query
- To do so, you use a `WITH` keyword:
```pgsql
WITH
  derived_table_name (new_col_names) AS (
    subquery
  )
SELECT ...
```

## CTE Details
- The number of columns returned by your subquery **must** match the number of column names you initially define
  - If you aren't going to rename the columns, you do not even really need to include the new column names, but it can be nice for clarity
- You do not need to include data types, as these will be inherited from the subquery columns
- You can include multiple derived tables in the CTE statement, separated with a comma
- Good use of CTE's can help clarify and simplify queries
  - _Especially_ if you have a subquery that you need to reference a few times, use a CTE!
-->