---
title: "Basic Tooling"
author: Jed Rembold
date: May 19/20, 2026
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
- Spring grades are in! So I can focus on you all now!
- Discord invite went out as Canvas announcement
- Also in announcement: a form with
  - Group preferences and strengths
  - Public SSH key (more on that in a bit)
- For next week:
  - Have the form filled out by end of Saturday please
  - Keep an eye out for connection credentials


## Evenings Plan
- SSH Remotes
- Docker Compose
- Parquet
<!--- DuckDB-->


# SSH Remotes

## Ensuring SSH Access
- On Windows:
    - Likely already have it installed. If not recognized, can enable by going to Settings → System → Optional Features and adding OpenSSH Client
    - Alternatively, install (or use) WSL2 (the Windows Subsystem for Linux). The Linux instance you get will already have SSH, so you use it through the WSL2 terminal
- On MacOS:
    - It is already there
- On Linux:
    - It is likely already there, or just install the OpenSSH package using your package manager

## Remote Sessions
- The program usually used to do so is called `ssh`, standing for "secure shell"
- To log into a remote server, the command looks something like:
  ```bash
  ssh {username}@{ip address or domain name}
  ```
  where
  - `username`{.bash} is your username on the _remote_ server (which may be different than your local username)
  - `ip address or domain name`{.bash} is either the direct ip address of the server (eg. 165.213.13.194) or the domain name (myserver.net)
  - Some servers may require a special port as well, which can be indicated with `-p`

## Streamlining Security
- Entering in your remote account password each time isn't onerous but can be inconvenient
- It is also only as secure as the quality of your password, which varies wildly
- Instead of a password, you can take advantage of an _ssh key_, which uses a public/private key authentication system
  - You generate (or use an existing) public and private key pair on your system
  - You upload the public key to the server you want to be able to connect to
  - The private key always stays only on your system. It is not shared!


## Using Keys
- To create a new key, you can use
  ```{.bash style='font-size:.9em'}
  ssh-keygen -t ed25519 -C {desc comment}
  ```
  - You will be asked for a passphrase for the key. You can go without and the key will still be much more secure than most password systems, but you could also add a passphrase necessary to "unlock" the key
  - Unless you name it differently, two files will be created inside your `.ssh` folder in your home directory: one with just `id_ed25519` and one with `id_ed25519.pub`

## Copying Keys
- Just creating the keys doesn't do anything special. You need to copy the public key over to whatever server you want to connect to.
- To copy the public key over to the desired server in Linux/MacOS:
  ```{.bash style='font-size:.9em'}
  ssh-copy-id {username}@{servername}
  ```
- In Windows:
  ```{.powershell style='font-size:.9em'}
  type $env:USERPROFILE\.ssh\id_ed25519.pub | ssh {username}@{servername} "cat >> ~/.ssh/authorized_keys"
  ```
  - Or install GitBash or WSL2


- You'll need to enter in your password one more time, but then the key contents will be copied over


## SSH Config
- Often, you are connecting to the same servers again and again
  - It would be nice not to have to repeat information about user name, server location, ports, etc
- Instead, you can set up "profiles" in your `{HOME}/.ssh/config` file
- A general profile entry might look something like:
  ```text
  Host {profile_name}
    User {username}
    HostName {domain name or ip address}
    Port {port, if not default}
  ```
- There are more options and settings that can be configured. See `man ssh_config`.


## Transferring Files
- You can transfer files between your local computer and a remote host in a few different ways
  - `scp`: The `scp` command combines normal `cp` and `ssh`, allowing you to include a remote server in the standard format
    ```bash
    scp local_file remote_server:remote_file
    ```
  - `rsync`: Syncs entire folders if desired, only transferring what information is necessary
    ```bash
    rsync -avP local_folder/ remote_server:remote_path
    ```
  - Use GitHub as an intermediatary. Write and push changes from your local system to GitHub, and then pull them down to the remote.

## Tunnels
- Often, remote software might be running that offers up servers on a **remote** port
  - A Postgres database, or a webserver
- Trying to access those services on your local machine, even if you SSH into the remote, will not work!
- So how can we still connect to those services on your _local_ machine?
  - By utilizing SSH tunnels!


## Local Tunnels
- There are technically two types of tunnels, but the type that will be most useful to us are called _local tunnels_
- Connect a resource that is available on the **remote** to a location on our _local_ system
- Requires adding an optional flag when calling ssh:
  ```{.bash style='white-space: pre'}
  ssh -L {local_dest}:{remote_target} {user}@{remote}
  ```
  - The local destination will usually just be some port on the localhost: `localhost:8080`{.bash}
  - The remote target is wherever the target is residing, often at a port on the other machine: `localhost:5432`{.bash}


## Config Tunnels
- If you _always_ want a tunnel to be created when you connect to a remote, you can set this up in your SSH config!
  ```text
  Host {profile_name}
    User {username}
    HostName {domain name or ip address}
    Port {port, if not default}
    LocalForward {local port} {remote target}
  ```
- You can create as many tunnels as you might need!

## Limiting Holes
::::::cols
::::col
- When connecting to a remote network, often times you want to make many systems or services on that network available
- Poking a multitude of holes in the firewall, one for each service, is not recommended
- Instead, generally you poke a single hole, and channel all traffic through there
::::

::::col
![Using a jump host or "bastion"](../images/jump_host.excalidraw.png)
::::
::::::

## SSH Jump Usage
- We can specify we want to use a jump host in various ways:
  - From the CLI:
    ```bash
    ssh -J {user}@{jumphost} {user}@{service}
    ```
  - From SSH Config
    ```text
    Host {your profile name}
      Hostname {service}
      User {user}
      ProxyJump {user}@{jumphost}
    ```
    - Note that the `ProxyJump` can also just point to another profile name
- Tunnels still work seamlessly! They will be created passing through the jump host


## Practice
:::{style='font-size:.9em'}

I emailed you all earlier with a server address and login information. Use that to work through the following:

- SSH into the server using your information and change your password using `passwd`. Note that when you type in passwords on most shells, they will not **show anything** for security but are indeed recording what you type.
- How many files are in your remote home directory initially? Some might be hidden!
- Exit out, and on your _local_ system generate an SSH key. Copy the public key over to the server. Ensure you can log in to the server now without needing your server password!
- Set up a simple profile in your `.ssh/config` file to facilitate connecting to this server
- This server has a webpage being served at port 8080. Can you create a tunnel to view its contents?


:::

# Break Time

# Docker Compose

## Why Compose?
- Docker by itself is great for single, isolated containers to accomplish a task
- But what if we have a collection of containers that all need to work together?
  - Eg. A database, a web scraper, and an API container?
- We could launch them and configure them all independently, but might result in ordering errors
  - The database must be running before the scraper could properly dump data to it, for instance
- Docker Compose allows us to _declaratively_ define all the settings for a collection of containers that we want to work together
  - This includes dependencies! Where one container must be running before another is started


## Installation of Compose
- Did you install Docker through Docker Desktop? The congrats! You already have Docker Compose.
- Otherwise (on Linux), install it through your package manager


## Anatomy of a Compose file
::::::cols
::::col
- Docker Compose utilizes a `docker-compose.yml` file to declare everything that should happen
- This includes:
  - Things that might affect multiple containers: `services`, `volumes`, `networks`
  - Things that affect an individual container: `image`, `ports`, `environment` variables, etc.
::::

::::col
- Docker run:
  ```bash
  docker run -p 8080:80 nginx:alpine
  ```
- Docker Compose:
  ```yaml
  services:
    web:
      image: nginx:alpine
      ports:
        - "8080:80"
  ```
::::
::::::

## Services
:::{style='font-size:.9em'}
- The predominant section of the Docker Compose file is usually the `services`
- Here you can name each of the service containers you want to create, along with customizing them:
  - `image: {name}` - the name of the Docker image you want the service to be based on
  - `build: .` - build an image from a `Dockerfile` present in this directory
  - `container-name: {name}` - an optional more meaningful container name 
  - `ports:` - a list of port mappings from host to container
  - `volumes:` - a list of volume mappings from host to container. Can be to local directories or to a more abstract volume
  - `environment:` - a collection of environment variables
  - `depends_on:` - list of other services this service depends on
:::

## YAML Gotchas
- Some things to be aware of when working with YAML files:
  - YAML is white-space sensitive!
  - Indent using spaces (not tabs), usually 2 spaces
  - Lists are created with an indented series of lines starting with `-`
  - Mappings (a dictionary) are created with an indented series of lines comprised of `key: value` pairings
- Older versions of Compose required a `version` key at the top. That is no longer necessary, but you'll still see it in many online Compose files

## Interconnectivity
- All services defined in a Compose file share an isolated network
- You can refer to service names directly (either in the Compose file or in code!) instead of always needing to reference localhost
  - This can streamline how different containers can talk to one another, without having to poke a millions port holes (ahoy!) in each container



## Persistence and Volumes
- Docker images are generally isolated. Which includes their local storage
  - If you restart the image, you start afresh with everything
- Sometimes we very much **want** persistance though (hello databases)!
- We declare a desire for persistence in Compose files by defining _volumes_, of which there are two main types:
  - Named volumes
  - Bind mounts


## Named Volumes
- A named volume essentially sets aside some space (and gives it a name) to be managed by Docker
- This space will persist across container restarts/rebuilds, and host reboots
- Best for long running persistence
  ```yaml
  services:
    db:
      image: postgres:17
      volumes:
        - pgdata:/var/lib/postgresql/data

  volumes:
    pgdata:
  ```

## Bind Mounts
- Bind mounts are basically akin to what we can do with the `-v` flag when using `docker run`
- Maps a local directory on the host to a directory in the container
- More tightly controls where on the host the data is stored
- Especially useful if you want to be able to add data to a folder and then have it accessible inside the container
  ```yaml
  services:
    db:
      image: postgres:17
      volumes:
        - ./pgdata:/var/lib/postgresql/data
        - ./mntdata:/mnt/data
  ```

## Compose Networking
- When you launch containers using Docker Compose, it automatically creates an isolated network upon start that all the containers belong to
- Containers are discoverable to one another: just using their service name as that hostname
- Inside the container, `localhost` means **that** container, not the host system.
- If you want to access content exposed inside a container, you must set up a port mapping
  ```yaml
  services:
    db:
      image: postgres:17
      ports:
        - "5050:5432"
  ```

## The lifecycle
- Now that we have these "stacks" of Docker containers defined in our Compose file, how do we use them?
- In the same folder as the `docker-compose.yml` file:
  - `docker compose up` starts everything up, with logging in the foreground
  - `docker compose up -d` starts everything up detached (in the background)
  - `docker compose down` stops and removes containers
  - `docker compose up --build` will rebuild anything that used a custom Dockerfile
  - `docker compose restart {service}` will restart a specific service from the stack

## Example
- Suppose we wanted to mimic a piece of our old setup from 503 where we have a scraping container dumping into a database container



# Parquet: The Filetype of Big Data

## What is a Parquet file?
- Parquet files are a _columnar_ storage format which enables:
  - Far better efficiency and compression
  - Schema support (data types and column names!)
  - Splittability (ideal for distributed processing)
- Example ideal use case: reading and processing 1 column from a 10GB table
- The price you pay: it is no longer human-readable natively
  - Honestly, it is worth the price, provided you are dealing with big enough data


## How?
- Parquet files are chunked up in a hierarchy:
  - Row groups represent large chunks of rows (often 10k+)
    - Each row group is chunked up into columns
      - Each column is chunked up into pages, which encode and compress the data
- Metadata is included both on the entire table and each column in each row group
  - Includes statistics, column types, row group information
- Each column can actually be compressed differently, depending on what is most ideal


## The Importance of Sorting
- Parquet is built around being fast and space-efficient for big data sets
- Proper sorting of a table can have **huge** effects on both of these!
  - It is possible to have a parquet file that offers very little benefit over something like a CSV if you do this actively badly


## Sorting for Row Groups
- Recall the first order of division is by a group of rows, before being broken up by column
- If data in a column is sorted, then this means the row group has a clear min and max, and these are stored in the metadata
- If filtering over the entire table then, only the metadate needs to be checked to see if a particular row group even needs to be accessed
- This is a cousin to the idea of creating an index on a column
  - Different approaches, but both with the idea of limiting how much information you need to look at to find the stuff you want


## Sorting for Compression
::::::cols
::::col
- Because columns are the fundamental storage unit (within their pages), Parquet files often leverage what is called _run-length encoding_
- This specifies that for repeat items in a row, only the number of repeats is stored
::::

::::col

| State | Town        | Name    |
|-------|-------------|---------|
| OR    | Salem       | Jed     |
| OR    | Salem       | Kristen |
| OR    | Portland    | Calvin  |
| OR    | Portland    | Hank    |
| OR    | Portland    | Rachel  |
| CA    | Los Angeles | Todd    |

::::
::::::

## The Big Takeaways
- Sorting on columns with low cardinality can make for huge space savings
- Sorting on columns likely to be heavily filtered can greatly speed up future query results


## Parquet Data Types
- One of the other benefits of Parquet files is that they maintain a schema
- That means understand what data types Parquet files support though
- Generally these can be broken down into two groups:
  - Physical Types
  - Logical Types


## Physical Types

| Type | Description                   |
|---------------|-------------------------------|
| `BOOLEAN`     | True/False                    |
| `INT32`       | 32-bit signed integer         |
| `INT64`       | 64-bit signed integer         |
| `FLOAT`       | 32-bit float                  |
| `DOUBLE`      | 64-bit float                  |
| `BYTE_ARRAY`  | Binary blobs or UTF-8 strings |

## Logical Types

| Type               | Backed by             | Description                   |
|--------------------|-----------------------|-------------------------------|
| `UTF8`             | `BYTE_ARRAY`          | Strings                       |
| `DECIMAL`          | `INT` or `BYTE_ARRAY` | high-precision (fixed) floats |
| `DATE`             | `INT32`               | Days since Unix epoch         |
| `TIME_MILLIS`      | `INT32`               | Time of day (ms precision)    |
| `TIMESTAMP_MICROS` | `INT64`               | DateTime with ms precision    |
| `LIST`             | Repeated fields       | Arrays/lists                  |
| `MAP`              | Nested fields         | Dictionaries                  |

## Potential Gotchas
- Most types map pretty clearly from how they are defined in other systems like SQL, Python, or R, but a few things to be aware of:
  - Some tools, especially Pandas, don't always preserve decimal precision when writing Parquet unless specifically commanded
  - Some tools may treat text as raw binary blobs unless explicitly set to strings for UT8
  - NULLS are handled a bit differently.
  - If working with compound data structures, ensure you are using a tool like Spark or DuckDB (coming soon!) that understand these types well
  - Some databases like MySQL store booleans as "TINYINT", which will map to `INT31` in Parquet unless forced to `BOOLEAN`

## Parquet Partitioning
- On top of the other ways the Parquet breaks down big datasets, it also supports one called _partitioning_
- Partitioning essentially splits a parquet file up by the values in some particular specified column
  - This is very similar to a physical manifestation of how `GROUP BY` works actually
- Writing out a partitioned parquet file will result in a folder for each unique value in the partitioned column
  - Again, this speeds up access. Any queries dealing with that value can jump immediately to the important file
  - Partitioning by things like year or day can be very common
- Tools like Spark or DuckDB can natively handle these partitions and optimize queries accordingly


## Working with Parquet
- In Python with Pandas (needs `pyarrow`)
  ```{.python style='white-space: pre; font-size:.75em'}
  df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 100]})
  df.to_parquet('students.parquet', engine='pyarrow')
  df2 = pd.read_parquet('students.parquet', engine='pyarrow')
  ```
- In R (needs the `arrow` or `nanoparquet` library)
  ```{.r style='white-space: pre; font-size:.75em'}
  df <- data.frame(name = c("Alice", "Bob"), score=c(95,100))
  write_parquet(df, "students.parquet")
  df2 <- read_parquet("students.parquet")
  ```

## Activity (if time)
- Download the most recently available yellow taxi ride data from the NYC Taxi Rides dataset [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2026-03.parquet)
- Load and inspect the metadata to determine how many rows it has per row-group

::::::cols
::::col
:::{.block name=Python}
```python
import pyarrow.parquet as pq
df = pq.ParquetFile(filename)
df.metadata
```
:::
::::

::::col
:::{.block name=R}
```r
library(nanoparquet)
ds <- read_parquet_metadata(filename)
```
:::
::::
::::::


<!-- next week
# The Database of Ducks

## A Versatile Tool
- `DuckDB` is an analytic database engine (column oriented)
- It is embeddable. It runs in the same process as your application
- No server is required. It can create a database in memory, or write to a file
  - Like SQLite but for analytics
- It is fast, works everywhere, connects to many systems, and utilizes clean SQL!


## Installation
- DuckDB is commonly used within other scripting engines, but it can also operate standalone as a CLI tool
- Install the CLI tool by following the quick guide [here](https://duckdb.org/docs/installation/?version=stable&environment=cli&platform=win&download_method=direct&architecture=x86_64). (Choose your appropriate architecture)
- Add Python bindings so that you can use it within Python
  ```bash
  pip install duckdb --upgrade
  ```
- Add R libraries so that you can use it within R
  ```r
  install.packages("duckdb")
  ```

## Why?
- I find DuckDB to be an excellent "swiss-army knife" of working with data, but a few big perks include:
  - Querying directly from parquet/CSV data
  - Uses mostly ANSI standard SQL
  - Seamlessly connecting to multiple databases simultaneously
  - Excellent parquet support
  - It is multithreaded by default
  - Excellent [documentation](https://duckdb.org/docs/stable/sql/introduction)


## Memory or Permanent?
- When you connect to DuckDB, you can provide a file location (or not)
- If you provide no location, everything you do in the "database" is ephemeral, and is not saved beyond the individual session
  - If you try to load anything greater than your system's available memory, it will overflow to disk, but will clean up when it exits
- If you do provide a location and file (usually with a `.duckdb` suffix), then the state of the database **is** maintained, and reconnecting to it later will show the same contents
- In this class, as a tool, I foresee you using it more in "memory mode", but it is very capable as a standalone database engine as well!


## Working with Parquet
- DuckDB has excellent parquet support!
- Query directly:
  ```sql
  SELECT * FROM 'my_data.parquet'
  WHERE state = 'OR';
  ```
  - This can include multiple files simultaneously!
- Write existing data:
  ```sql
  COPY my_table TO 'my_table.parquet' (FORMAT 'parquet');
  ```
  - You can also specify compression options (though it uses a good one by default)
  - It can write partitioned Parquet files
-->
