---
title: "Remote Joins"
author: Jed Rembold
date: February 5, 2025
slideNumber: true
theme: nord_light
highlightjs-theme: nord
width: 1920
height: 1080
transition: slide
---


## Announcements
- Homework
  - Did HW3 get submitted?
  - Homework Solutions are being posted! You'll need a password, which I will include as a comment to the last problem feedback.
  - Homework 4 is posted
  - Deadline adjustment! Tuesday or Wednesday nights? Vote in Discord!
<!--- Polling: [rembold-class.ddns.net](http://rembold-class.ddns.net)-->
- For next week, read:
  - DeBarros: Ch 8


<!--
## The Big Picture
![Storage and Retrieval Island](../images/ch3_island.png){width=60%}


## Why so technical?
> "On a most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you"

- Why should a data engineer care about the ins and outs of database storage and retrieval?
  - **Not** so you can try to design better storage systems from scratch
  - _So you can choose a storage engine to match your needs_
    - A real focus here is on knowing just enough of the field and terminology so that you could make informed decisions and understand documentation that might help you make those decisions.


## A case of indexing
- Writing data to storage and reading the data back from storage frequently have very different needs
- Most methods will enact tradeoffs between the two
  - Storing something quickly will likely make it more difficult to find later
    - Example: Tossing a book into an empty spot on your bookcase
  - Storing something so that it can be quickly found likely takes longer
    - Example: Placing your book in the alphabetically correct location by author
- Using an _index_ will generally assist with retrieval, but costs time on writes to update the index
  - Often only some columns will be indexed for fast retrieval


## Beavers Love them
::::::cols
::::col
- A _log_ in a database sense is an **append-only** sequence of records
- Updated records just have new values assigned later in the log
- Most recent entries to the log have priority
::::

::::col
![A Log](../images/db_log.png)

::::
::::::


## #Hash Indexing
- Stores key-value pairs, where the key is used to determine a _hash_
- Imagine a hash as a sign that says: "Look right here for this value"
- Log of transactions and values are stored on disk, but the hash-map is stored in memory
- When a log gets too large, a new one is started and the old log is compacted
  - New log segment gets a new hash-map in memory
  - Logs are later merged to further compact them
- Pros: sequential writes are good! Crash recovery is simple
- Cons: All the hash maps need to fit in memory. Range queries inefficient


## SSTables
- Still a form of log, so append-only
- Keys are stored in segments in sorted order

![Example SSTable structure$^1$](../images/sstable.png)


## LSM-Tree: Writing
- A _log-structure merge-tree_ stores new writes initially in a _memtable_
- The memtable is really a tree structure that sorts the keys as they come in
- When the tree gets too large, the keys are flushed to a new SSTable on disk

![LSM to SSTable$^1$](../images/lsm_to_sstable.png){width=70%}

## LSM-Tree: Reading
- Would normally still look for a corresponding key in the most recent segment
- Segments can often use an in-memory _sparse index_ that takes advantage of the ordered state of the keys for faster look-ups

![Improving finding speed with a sparse index$^1$](../images/sparse_index.png)


## LSM Cost/Benefit
- Pros
  - Faster look-ups owing to ordered keys
  - Not all keys need to exist in memory
  - Works very well for write-heavy situations, as it only performs sequential writes (which require less disk scanning)
- Cons
  - A crash would lose all the data currently in the memtable that hasn't been written to an SSTable
    - Can also keep an unordered hash table log on disk of all writes whose only existence is to restore the memtable in case of crash
  - Searching for files not in the database still takes a long time as all segments must be searched
    - Using _bloom filters_ can help by immediately reporting if the key is not in the database


## B-Trees
- Most widely used storage architecture
- Constrains segments of key-value pairings by size (often around 4 KB)
- Keys can either point to the value in memory, or they can point to another segment
  - Segments pointing to other segments is where the tree structure comes in


## B-Tree Look-up
![Showing the structure and looking up a key in a B-Tree](../images/b-tree-loopup.png){width=80%}

## B-Tree Writing
- The tree is traversed and the key is attempted to be found
  - If it is, the reference to the value is change **in place**
  - If not, the key and value are added to that correct segment
- If a segment is full (fixed size remember), then it is split into two new segments, each half full.
  - This also forces the parent segment to need to be updated (and possibly split)


## B-Tree Cost/Benefit
- Pros
  - Most values can be accessed in less than 4 segments, making for **very fast read speeds**
    - Also leads to more predictable response times
  - Store each key in only one location, so less chance of competing processes getting confused
- Cons
  - Crashes mid-write can be highly problematic
    - Often keeps a _write-ahead log_ of what is going to be done in case of crash, but this doubles the amount to be written
  - More easily fragment as a result of small amounts of space left between segments


## Different Use Cases
- Large-scale differences in how databases will be used
- OLTP: Online Transaction Processing
- OLAP: Online Analytic Processing

![View of a typical pipeline](../images/oltp_vs_olap.png){width=50%}


## Transaction vs Analytics

| Property    | Transaction Processing (OLTP)                | Analytic Processing (OLAP)              |
|-------------|----------------------------------------------|-----------------------------------------|
| Read trend  | Small number of records per query            | Aggregate over large numbers of records |
| Write trend | Random-access, low latency writes from users | Bulk import (ELT)                       |
| Used by     | End user or customers                        | Internal analyst                        |
| Represents  | Latest state of data                         | History of data events                  |
| Sizes       | Gigabytes to terabytes                       | Terabytes to petabytes                  |


## Analytic Processing
::::::cols
::::col
- Needs to focus on handling huge queries and aggregates
  - Relational databases a good fit for this
- Frequently end up with "star" or "snowflake" type schematics
  - One central fact table with other surrounding tables containing details
::::

::::col
![A simple star schema](../images/star_schema.png)

::::
::::::

## Analytic Optimizations
- Column-data stored together
  - As opposed to OLTP solutions, where data from a row is stored together
  - Most analysis looks at a single column over all rows, so this makes reads faster
- Compressing column data as bitmaps
  - Makes boolean searches very speedy, as binary arithmetic can be used
- Computing _data cubes_
  - Pre-computing common (and computationally heavy) aggregates in their own table

## Back to the Big Picture
![Storage and Retrieval Island](../images/ch3_island.png){width=60%}
-->

# Connecting Remotely
## Remote Sessions
- An extremely useful aspect of working in a shell is the simplicity with which you can connect to other remote systems
- The program usually used to do so is called `ssh`, standing for "secure shell"
- To log into a remote server, the command looks something like:
  ```bash
  ssh {user name}@{ip address or domain name}
  ```
  where
  - `user name`{.bash} is your user name on the _remote_ server (which may be different than your local name)
  - `ip address or domain name`{.bash} is either the direct ip address of the server (eg. 165.213.13.194) or the domain name (myserver.net)
  - Some servers may require a special port as well, which can be indicated with `-p`


## Working Remotely
- Upon remote connection to a server, you will just end up in another shell, usually another Bash shell
- This is where being comfortable in a shell can really shine!
  - Anything you could normally do in a shell, you can do here instead, but it happens on the remote system
- When you are done, you can type `exit`{.bash} to leave the remote connection and return to your local shell
  - Pay attention to your shell prompts! It is easy to get confused of if you are on the remote server or on your local system!


## Executing commands directly
- If you just need to run a quick command on the remote system, you can do so directly from `ssh`
- Simply add the desired command as the last `ssh` argument
  ```bash
  ssh username@server mycommand
  ```
  - This runs the command on the remote server, but then gets the stdout and brings it back to be displayed in your local stdout
  - This means you could then pipe _that_ output into a program that only exists locally on your system to further process the data!

# Connecting Efficiently
## Streamlining Security
- Entering in your account password each time isn't onerous but can be inconvenient
  - It makes it impossible to schedule automatic tasks that would connect to a server, for instance
- Instead of a password, you can take advantage of an _ssh key_, which uses a public/private key authentication system
  - You generate (or use an existing) public and private key pair on your system
  - You upload the public key to the server you want to be able to connect to
  - The private key always stays only on your system. It is not shared!


## Using Keys
- To create a new key, you can use
  ```{.bash style='font-size:.9em'}
  ssh-keygen -t ed25519 -C {desc comment}
  ```
  - You will be asked for a passphrase for the key. You can go without and the key will still be much more secure than most password systems, but you could also add a passphrase necessary to "unlock" the key
  - Two files will be created inside your `.ssh` folder in your home directory: one with just `id_ed25519` and one with `id_ed25519.pub`
- To copy the public key over to the desired server:
  ```{.bash style='font-size:.9em'}
  ssh-copy-id {username}@{servername}
  ```

  - You'll need to enter in your password one more time, but then the key contents will be copied over


## SSH Config
- Often, you are connecting to the same servers again and again
  - It would be nice not to have to repeat information about user name, server location, ports, etc
- Instead, you can set up "profiles" in your `.ssh/config` file
- A general profile entry might look something like:
  ```text
  Host {profile_name}
    User {username}
    HostName {domain name or ip address}
    Port {port, if not default}
  ```
- There are more options and settings that can be configured. See `man ssh_config`.

# Connecting Transfers
## Copy That
- One important thing that you might frequently need to do is copy files between your local system and remote server
- Here you have options
  - SSH + `tee`: The `tee` command "splits" a stream, displaying it both to stdout and writing it to a file at the same time. You can thus do things like:
    ```bash
    cat local_file | ssh remote_server tee remote_file
    ```
    using normal pipes
  - `scp`: The `scp` command combines normal `cp` and `ssh`, allowing you to include a remote server in the standard format
    ```bash
    scp local_file remote_server:remote_file
    ```

## Sync it!
:::{style='font-size:.9em'}
- The previous options can be nice for just copying single files, but what if you need to copy over entire folders?
- `rsync` is probably your best option
  ```bash
  rsync -avP local_folder/ remote_server:remote_path
  ```
  - Clever about what is transferred: only copying over necessary data that isn't already present on the other system
  - Can maintain file/folder permissions, links, etc
- Common options
  - `-a` is for archive, and basically means: "make a perfect copy"
  - `-v` is for verbose, to output more information as it is copied
  - `-P` is for partial and progress, so that partial transfers will resume and progress output to the screen
:::

## Your Turn!

:::{style='font-size:.9em'}

I emailed you all earlier with a server address and login information. Use that to work through the following:

- SSH into the server using your information and change your password using `passwd`. Note that when you type in passwords on most shells, they will not **show anything** for security but are indeed recording what you type.
- How many files are in your remote home directory initially? Some might be hidden!
- Exit out, and on your _local_ system generate an SSH key. Copy the public key over to the server. Ensure you can log in to the server now without needing your server password!
- Set up a simple profile in your `.ssh/config` file to facilitate connecting to this server


:::

## Children Processes
- Whenever another program launches another, the program that is launched is said to be a _child process_
  - This includes any program launched from the shell being a child process of the shell itself!
- When a parent program is terminated, part of a "clean" termination involves shutting down any child processes that were spawned by the parent
  - This helps prevent unnecessary or unwanted programs from continuing to run in the background and essentially doing nothing
- This can be unfortunate when remotely connecting, as it means you can not leave a program running
  - Running a multiplexer like tmux can assist with this


# Break Time!
## Break Time!
- Stretch! Eat! Don't think about data for half an hour!

<!--
## Understanding Check
::::::cols
::::col
Suppose you have a table of items with the schema shown below and wanted to find the most common difference between Price A and Price B. You are looking for bargains, so in the case of a tie, you'd like to know the biggest. Which query will deliver on your wish?

```{.pgsql style='font-size:.8em'}
CREATE TABLE rev (
  "name" TEXT,
  "pa" NUMERIC(4,2),
  "pb" NUMERIC(4,2)
);
```
::::

::::{.col style='flex-grow:1.2'}
```{.pgsql style='font-size:.6em' .answer name='A'}
SELECT mode(pb - pa) FROM rev;
```
```{.pgsql style='font-size:.6em' .answer name='B'}
SELECT
  mode() WITHIN GROUP (ORDER BY pb) -
  mode() WITHIN GROUP (ORDER BY pa)
FROM rev;
```
```{.pgsql style='font-size:.6em' .answer name='C'}
SELECT
  mode() WITHIN GROUP (ORDER BY (pb-pa))
FROM rev;
```
```{.pgsql style='font-size:.6em' .answer name='D'}
SELECT
  mode() WITHIN GROUP (ORDER BY (pb-pa) DESC)
FROM rev;
```

::::
::::::
-->

# Selecting Across Tables
## Linking Tables
::::::{.cols style='align-items: center'}
::::col
- Our whole idea of breaking apart data across multiple tables was prefaced on the fact that we could pull it back together when needed
- There is nothing special about the linkages: we can link any rows that we want
- The act of collecting data from multiple tables based on particular rows and columns is called a _join_ in SQL
::::

::::col
![](../images/join-me-vader.jpg){width=100%}
::::
::::::

## Creating the Join
- A join pulls information from multiple tables into a new table (since all queries return a table)
- The columns that are matched across tables are called _keys_
- The general idea is then to:
    - Set up your selection as usual from a single table
    - Join to that table another table
    - Specifying what columns in each table will act as keys along with a conditional relating them
        - Most common condition is equality

```{.pgsql style='font-size:0.6em'}
SELECT * FROM table_a
JOIN table_b
ON table_a.key_col = table_b.key_col;
```

## Column Names
- When you start refering to multiple table names in your query, you might get overlapping column names
    - Columns names must be unique **within a table** but might be the same **across tables**
- To avoid ambiguity, you can preface a column name with the table it is coming from, separated by a period
- This is useful both for selecting the join key columns, but also for selecting particular columns you want out of the joined table

```pgsql
SELECT tab1.name, tab1.age, tab2.name
FROM tab1
JOIN tab2 ON tab1.age = tab2.age;
```

## Cross Join
- Sometimes you want to see _all the possible_ combinations between the rows of two tables
    - Sometimes called a _cartesian product_
- A `CROSS JOIN` will return a table of all of these possibilities
- Could imagine cross joining all the values with all the suits to generate your standard 52 card deck of playing cards.
- These can get very large very fast!
    - Do **not** run on tables of millions of rows!


## Cross Joins Visualized
<div class="fig-container" data-file="../images/d3/CrossJoin.html" data-scroll="no", data-style="width:90%; display:inline;"></div>


## (Inner) Join
- The basic join only keeps rows from table 1 and table 2 that matched on the given column keys
    - This is also called an _inner join_
    - Essentially a cross join with a filter statement
- Any row in table 1 that had no counterpart in table 2 is left out
    - Identically for any row in table 2 that had no counterpart in table 1
- The key take-away is that it keeps what was in **both** tables
- If a value appears twice in one table, it will be duplicated in the joined table as well
    - One reason that many times people try to join on columns that hold unique values, but not always necessary


## Inner Joins Visualized
<div class="fig-container" data-file="../images/d3/InnerJoin.html" data-scroll="no", data-style="width:90%; display:inline;"></div>


## Left and Right Join
- Sometimes, you don't want to include _only_ the rows that were in both table
- Maybe you want **all** the rows from one table, but joining the other data when it is available
- In these cases, you can use a `LEFT JOIN` or `RIGHT JOIN`
    - `LEFT JOIN` is decidedly the more common, and you can make any `RIGHT JOIN` a `LEFT JOIN` just by flipping the table ordering
- Rows still need to have the same number of columns, so `NULL` values will be inserted for the secondary table columns if it is missing a match


## Left Joins Visualized
<div class="fig-container" data-file="../images/d3/LeftJoin.html" data-scroll="no", data-style="width:90%; display:inline;"></div>


## FULL OUTER JOIN
- On occasion, you just want _all_ the data from both tables
    - Matching where possible
    - But keeping data from both left or right tables if no match
- In these cases, a `FULL OUTER JOIN` will do what you want
    - Essentially does a `LEFT JOIN` followed by a `RIGHT JOIN` with the existing table
- Anything without a match is still represented with `NULL` values


## Outer Joins Visualized
<div class="fig-container" data-file="../images/d3/OuterJoin.html" data-scroll="no", data-style="width:90%; display:inline;"></div>


# Practicing Joins

## Gotta Practice
- The difficult part of joins isn't understanding the vocabulary of what each does, it is in understanding for a give question and data model what type of join you want to be able to answer the question.
- You can do the following practice parts without data, but sometimes it helps to have something to play with and visualize
  - Data [here](../activity_data/assignment_submission_joins.sql)

## Practice

![](../images/hw_assignment_schema2.png){width=70%}

:::{.center .quip}
First names of students who have submitted any assignment?
:::

## Practice

![](../images/hw_assignment_schema2.png){width=70%}

:::{.center .quip}
Student ID and name of the assignment for all perfect scores?
:::

## Practice

![](../images/hw_assignment_schema2.png){width=70%}

:::{.center .quip}
Number of assignments with no submissions?
:::

## Practice

![](../images/hw_assignment_schema2.png){width=70%}

:::{.center .quip}
All combinations of students and homework assignments?
:::


# Complicating Joins
## Multiple Join Conditionals
- You are not limited to just a single condition in your `ON` statement!
- You can chain multiple conditions together with `AND` or `OR`, just like you could with `WHERE`
- Just recall when comparing two rows that ALL the conditions must be true for the data to be included in the joined table
```pgsql
SELECT *
FROM table1
JOIN table2
ON table1.col1 = table2.col1 
   AND table1.col2 > table2.col2;
```

## Table Aliases
- Including long table names before each column name when referring to information from two different tables can get tedious
- You can define aliases for table names just like you can for column names!
- Syntax looks just like column aliases, using `AS`
- Can come immediately after you first reference a table name
  - Usually after a `FROM` or `JOIN` statement
- In truth, the `AS` is optional, but it helps some with readability
```{.pgsql style='font-size:.9em'}
SELECT *
FROM tablename AS tn
JOIN tablename2 AS tn2
ON tn.col1 = tn2.col2;
```


## Multiple Joins
- Nothing stops you from including multiple joins in your query
- Each additional join links back to the current growing joined table
  - This means a second join is treating the entire initial join as the "left" table
  - Commonly, you'll just be joining back to the original table, so it won't be apparent

```pgsql
SELECT *
FROM tablename AS t1
JOIN tablename2 AS t2 ON t1.col1 = t2.col1
JOIN tablename3 AS t3 ON t1.col2 = t3.col1;
```

## More Practice

![](../images/hw_assignment_schema2.png){width=70%}

:::{.center .quip}
First names of all individuals who are missing at least one assignment (no submission made)?
:::


## Self Joins
- You can actually join a table to itself!
- Why would you want to do this?
  - Hierarchy data can frequently be explored in this fashion
  - Comparisons between rows of a table
- You **need** to give unique aliases when doing this, or else you won't have a way to distinguish between which columns you want


## Joins as Venn Diagrams
- Sometimes it may help to think of different types of joins as Venn diagrams

![](../images/joins_venn_diagrams.png){width=60%}


<!--
## Image/Animation Credits
- [1] https://yetanotherdevblog.com/
- https://dataschool.com/how-to-teach-people-sql
-->