Basic Tooling

Jed Rembold

May 19/20, 2026

Announcements

Spring grades are in! So I can focus on you all now!
Discord invite went out as Canvas announcement
Also in announcement: a form with
- Group preferences and strengths
- Public SSH key (more on that in a bit)
For next week:
- Have the form filled out by end of Saturday please
- Keep an eye out for connection credentials

Evenings Plan

SSH Remotes
Docker Compose
Parquet

SSH Remotes

Ensuring SSH Access

On Windows:
- Likely already have it installed. If not recognized, can enable by going to Settings → System → Optional Features and adding OpenSSH Client
- Alternatively, install (or use) WSL2 (the Windows Subsystem for Linux). The Linux instance you get will already have SSH, so you use it through the WSL2 terminal
On MacOS:
- It is already there
On Linux:
- It is likely already there, or just install the OpenSSH package using your package manager

Remote Sessions

The program usually used to do so is called ssh, standing for “secure shell”
To log into a remote server, the command looks something like:
```
ssh {username}@{ip address or domain name}
```
where
- username is your username on the remote server (which may be different than your local username)
- ip address or domain name is either the direct ip address of the server (eg. 165.213.13.194) or the domain name (myserver.net)
- Some servers may require a special port as well, which can be indicated with -p

Streamlining Security

Entering in your remote account password each time isn’t onerous but can be inconvenient
It is also only as secure as the quality of your password, which varies wildly
Instead of a password, you can take advantage of an ssh key, which uses a public/private key authentication system
- You generate (or use an existing) public and private key pair on your system
- You upload the public key to the server you want to be able to connect to
- The private key always stays only on your system. It is not shared!

Using Keys

To create a new key, you can use
```
ssh-keygen -t ed25519 -C {desc comment}
```
- You will be asked for a passphrase for the key. You can go without and the key will still be much more secure than most password systems, but you could also add a passphrase necessary to “unlock” the key
- Unless you name it differently, two files will be created inside your .ssh folder in your home directory: one with just id_ed25519 and one with id_ed25519.pub

Copying Keys

Just creating the keys doesn’t do anything special. You need to copy the public key over to whatever server you want to connect to.
To copy the public key over to the desired server in Linux/MacOS:
```
ssh-copy-id {username}@{servername}
```

In Windows:

type $env:USERPROFILE\.ssh\id_ed25519.pub | ssh {username}@{servername} "cat >> ~/.ssh/authorized_keys"

Or install GitBash or WSL2

You’ll need to enter in your password one more time, but then the key contents will be copied over

SSH Config

Often, you are connecting to the same servers again and again
- It would be nice not to have to repeat information about user name, server location, ports, etc
Instead, you can set up “profiles” in your {HOME}/.ssh/config file

A general profile entry might look something like:

Host {profile_name}
  User {username}
  HostName {domain name or ip address}
  Port {port, if not default}

There are more options and settings that can be configured. See man ssh_config.

Transferring Files

You can transfer files between your local computer and a remote host in a few different ways
- scp: The scp command combines normal cp and ssh, allowing you to include a remote server in the standard format
```
scp local_file remote_server:remote_file
```
- rsync: Syncs entire folders if desired, only transferring what information is necessary
```
rsync -avP local_folder/ remote_server:remote_path
```
- Use GitHub as an intermediatary. Write and push changes from your local system to GitHub, and then pull them down to the remote.

Tunnels

Often, remote software might be running that offers up servers on a remote port
- A Postgres database, or a webserver
Trying to access those services on your local machine, even if you SSH into the remote, will not work!
So how can we still connect to those services on your local machine?
- By utilizing SSH tunnels!

Local Tunnels

There are technically two types of tunnels, but the type that will be most useful to us are called local tunnels
Connect a resource that is available on the remote to a location on our local system
Requires adding an optional flag when calling ssh:
```
ssh -L {local_dest}:{remote_target} {user}@{remote}
```
- The local destination will usually just be some port on the localhost: localhost:8080
- The remote target is wherever the target is residing, often at a port on the other machine: localhost:5432

Config Tunnels

If you always want a tunnel to be created when you connect to a remote, you can set this up in your SSH config!

Host {profile_name}
  User {username}
  HostName {domain name or ip address}
  Port {port, if not default}
  LocalForward {local port} {remote target}

You can create as many tunnels as you might need!

Limiting Holes

When connecting to a remote network, often times you want to make many systems or services on that network available
Poking a multitude of holes in the firewall, one for each service, is not recommended
Instead, generally you poke a single hole, and channel all traffic through there

SSH Jump Usage

We can specify we want to use a jump host in various ways:
- From the CLI:
```
ssh -J {user}@{jumphost} {user}@{service}
```
- From SSH Config
```
Host {your profile name}
  Hostname {service}
  User {user}
  ProxyJump {user}@{jumphost}
```
  - Note that the ProxyJump can also just point to another profile name
Tunnels still work seamlessly! They will be created passing through the jump host

Practice

I emailed you all earlier with a server address and login information. Use that to work through the following:

SSH into the server using your information and change your password using passwd. Note that when you type in passwords on most shells, they will not show anything for security but are indeed recording what you type.
How many files are in your remote home directory initially? Some might be hidden!
Exit out, and on your local system generate an SSH key. Copy the public key over to the server. Ensure you can log in to the server now without needing your server password!
Set up a simple profile in your .ssh/config file to facilitate connecting to this server
This server has a webpage being served at port 8080. Can you create a tunnel to view its contents?

Break Time

Docker Compose

Why Compose?

Docker by itself is great for single, isolated containers to accomplish a task
But what if we have a collection of containers that all need to work together?
- Eg. A database, a web scraper, and an API container?
We could launch them and configure them all independently, but might result in ordering errors
- The database must be running before the scraper could properly dump data to it, for instance
Docker Compose allows us to declaratively define all the settings for a collection of containers that we want to work together
- This includes dependencies! Where one container must be running before another is started

Installation of Compose

Did you install Docker through Docker Desktop? The congrats! You already have Docker Compose.
Otherwise (on Linux), install it through your package manager

Anatomy of a Compose file

Docker Compose utilizes a docker-compose.yml file to declare everything that should happen
This includes:
- Things that might affect multiple containers: services, volumes, networks
- Things that affect an individual container: image, ports, environment variables, etc.

Docker run:
```
docker run -p 8080:80 nginx:alpine
```

Docker Compose:

services:
  web:
    image: nginx:alpine
    ports:
      - "8080:80"

Services

The predominant section of the Docker Compose file is usually the services
Here you can name each of the service containers you want to create, along with customizing them:
- image: {name} - the name of the Docker image you want the service to be based on
- build: . - build an image from a Dockerfile present in this directory
- container-name: {name} - an optional more meaningful container name
- ports: - a list of port mappings from host to container
- volumes: - a list of volume mappings from host to container. Can be to local directories or to a more abstract volume
- environment: - a collection of environment variables
- depends_on: - list of other services this service depends on

YAML Gotchas

Some things to be aware of when working with YAML files:
- YAML is white-space sensitive!
- Indent using spaces (not tabs), usually 2 spaces
- Lists are created with an indented series of lines starting with -
- Mappings (a dictionary) are created with an indented series of lines comprised of key: value pairings
Older versions of Compose required a version key at the top. That is no longer necessary, but you’ll still see it in many online Compose files

Interconnectivity

All services defined in a Compose file share an isolated network
You can refer to service names directly (either in the Compose file or in code!) instead of always needing to reference localhost
- This can streamline how different containers can talk to one another, without having to poke a millions port holes (ahoy!) in each container

Persistence and Volumes

Docker images are generally isolated. Which includes their local storage
- If you restart the image, you start afresh with everything
Sometimes we very much want persistance though (hello databases)!
We declare a desire for persistence in Compose files by defining volumes, of which there are two main types:
- Named volumes
- Bind mounts

Named Volumes

A named volume essentially sets aside some space (and gives it a name) to be managed by Docker
This space will persist across container restarts/rebuilds, and host reboots

Best for long running persistence

services:
  db:
    image: postgres:17
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Bind Mounts

Bind mounts are basically akin to what we can do with the -v flag when using docker run
Maps a local directory on the host to a directory in the container
More tightly controls where on the host the data is stored

Especially useful if you want to be able to add data to a folder and then have it accessible inside the container

services:
  db:
    image: postgres:17
    volumes:
      - ./pgdata:/var/lib/postgresql/data
      - ./mntdata:/mnt/data

Compose Networking

When you launch containers using Docker Compose, it automatically creates an isolated network upon start that all the containers belong to
Containers are discoverable to one another: just using their service name as that hostname
Inside the container, localhost means that container, not the host system.
If you want to access content exposed inside a container, you must set up a port mapping
```
services:
  db:
    image: postgres:17
    ports:
      - "5050:5432"
```

The lifecycle

Now that we have these “stacks” of Docker containers defined in our Compose file, how do we use them?
In the same folder as the docker-compose.yml file:
- docker compose up starts everything up, with logging in the foreground
- docker compose up -d starts everything up detached (in the background)
- docker compose down stops and removes containers
- docker compose up --build will rebuild anything that used a custom Dockerfile
- docker compose restart {service} will restart a specific service from the stack

Example

Suppose we wanted to mimic a piece of our old setup from 503 where we have a scraping container dumping into a database container

Parquet: The Filetype of Big Data

What is a Parquet file?

Parquet files are a columnar storage format which enables:
- Far better efficiency and compression
- Schema support (data types and column names!)
- Splittability (ideal for distributed processing)
Example ideal use case: reading and processing 1 column from a 10GB table
The price you pay: it is no longer human-readable natively
- Honestly, it is worth the price, provided you are dealing with big enough data

How?

Parquet files are chunked up in a hierarchy:
- Row groups represent large chunks of rows (often 10k+)
  - Each row group is chunked up into columns
    - Each column is chunked up into pages, which encode and compress the data
Metadata is included both on the entire table and each column in each row group
- Includes statistics, column types, row group information
Each column can actually be compressed differently, depending on what is most ideal

The Importance of Sorting

Parquet is built around being fast and space-efficient for big data sets
Proper sorting of a table can have huge effects on both of these!
- It is possible to have a parquet file that offers very little benefit over something like a CSV if you do this actively badly

Sorting for Row Groups

Recall the first order of division is by a group of rows, before being broken up by column
If data in a column is sorted, then this means the row group has a clear min and max, and these are stored in the metadata
If filtering over the entire table then, only the metadate needs to be checked to see if a particular row group even needs to be accessed
This is a cousin to the idea of creating an index on a column
- Different approaches, but both with the idea of limiting how much information you need to look at to find the stuff you want

Sorting for Compression

Because columns are the fundamental storage unit (within their pages), Parquet files often leverage what is called run-length encoding
This specifies that for repeat items in a row, only the number of repeats is stored

State	Town	Name
OR	Salem	Jed
OR	Salem	Kristen
OR	Portland	Calvin
OR	Portland	Hank
OR	Portland	Rachel
CA	Los Angeles	Todd

The Big Takeaways

Sorting on columns with low cardinality can make for huge space savings
Sorting on columns likely to be heavily filtered can greatly speed up future query results

Parquet Data Types

One of the other benefits of Parquet files is that they maintain a schema
That means understand what data types Parquet files support though
Generally these can be broken down into two groups:
- Physical Types
- Logical Types

Physical Types

Type	Description
`BOOLEAN`	True/False
`INT32`	32-bit signed integer
`INT64`	64-bit signed integer
`FLOAT`	32-bit float
`DOUBLE`	64-bit float
`BYTE_ARRAY`	Binary blobs or UTF-8 strings

Logical Types

Type	Backed by	Description
`UTF8`	`BYTE_ARRAY`	Strings
`DECIMAL`	`INT` or `BYTE_ARRAY`	high-precision (fixed) floats
`DATE`	`INT32`	Days since Unix epoch
`TIME_MILLIS`	`INT32`	Time of day (ms precision)
`TIMESTAMP_MICROS`	`INT64`	DateTime with ms precision
`LIST`	Repeated fields	Arrays/lists
`MAP`	Nested fields	Dictionaries

Potential Gotchas

Most types map pretty clearly from how they are defined in other systems like SQL, Python, or R, but a few things to be aware of:
- Some tools, especially Pandas, don’t always preserve decimal precision when writing Parquet unless specifically commanded
- Some tools may treat text as raw binary blobs unless explicitly set to strings for UT8
- NULLS are handled a bit differently.
- If working with compound data structures, ensure you are using a tool like Spark or DuckDB (coming soon!) that understand these types well
- Some databases like MySQL store booleans as “TINYINT”, which will map to INT31 in Parquet unless forced to BOOLEAN

Parquet Partitioning

On top of the other ways the Parquet breaks down big datasets, it also supports one called partitioning
Partitioning essentially splits a parquet file up by the values in some particular specified column
- This is very similar to a physical manifestation of how GROUP BY works actually
Writing out a partitioned parquet file will result in a folder for each unique value in the partitioned column
- Again, this speeds up access. Any queries dealing with that value can jump immediately to the important file
- Partitioning by things like year or day can be very common
Tools like Spark or DuckDB can natively handle these partitions and optimize queries accordingly

Working with Parquet

In Python with Pandas (needs pyarrow)

df = pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 100]})
df.to_parquet('students.parquet', engine='pyarrow')
df2 = pd.read_parquet('students.parquet', engine='pyarrow')

In R (needs the arrow or nanoparquet library)

df <- data.frame(name = c("Alice", "Bob"), score=c(95,100))
write_parquet(df, "students.parquet")
df2 <- read_parquet("students.parquet")

Activity (if time)

Download the most recently available yellow taxi ride data from the NYC Taxi Rides dataset here
Load and inspect the metadata to determine how many rows it has per row-group

import pyarrow.parquet as pq
df = pq.ParquetFile(filename)
df.metadata

library(nanoparquet)
ds <- read_parquet_metadata(filename)