Scraping a Railway on Schedule

Jed Rembold

Mar 5, 2025

Announcements

Midterm video deadline tonight!
Upcoming Project Starts
- We are coming up on a point where it would be useful to strategize with your partner about possible project ideas
- I’m sending out a poll later tonight about who you might want (or not want) to work with, and then I’ll make the pairings and announce them tomorrow.
- You will be setting up (individually) a scraper in the coming week to be collecting data on a daily basis from some website or API
- You should strategize with your partner about what two difference sources you both can be collecting data for that would have interesting relationships that you could later analyze
- Most obvious join options would be date, but could be more interesting relationships depending on what you are both gathering
No homework this coming week! Just getting started on the project!

Gathering Data

Webscraping

Webscraping is the act of extracting information from a URL so that it can be collected or otherwise used elsewhere
It can take multiple forms, with varying degrees of complexity (most of which depends on the website)
- Extracting information from a table of data
- Extracting other information on a webpage that is not necessarily formatted
- Extracting information using a provided API endpoint
The method we’ll showcase today will let you approach situations where you want to grab information from a particular table on a website, or in accessing an API

Web2DB

To focus on the acquisition of data, without worrying about the scripting details, I have created a Docker container you can use
```
docker pull jrembold/web2db:latest
```
This container will handle:
- Grabbing a static table from a webpage, or
- Retrieving a response from an API endpoint, and
- Optionally: Writing the data as JSON to a file
- Optionally: Adding the data as JSON to an indicated PostgreSQL table
- Optionally: scheduling the same operation to run with a provided frequency

Controlling the Environment

You control the behavior of the Web2DB container through environment variables
- Keeps you have needing to build your own images on top for customization
Usually a good idea to store secrets such as database passwords in environment variables anyway
Locally, the easiest way to handle the variables is to create a vars.env file and provide it to the docker container
```
docker run --rm --env-file vars.env web2db
```
You can specify them individually, one after each -e flag, if you really wanted
```
docker run --rm -e SITE_URL=www.google.com web2db
```

Web2DB Environment Variables

web2db recognizes (or requires) 6 environment variables:
- SITE_URL (required)
- TABLE_NUM (situational/optional)
- FILE_NAME (optional)
- DATABASE_URL (optional)
- TABLE_NAME (situational)
- DEBUG (optional)
- HEADER_KEYS (optional)

When providing in vars.env, use no spaces:

SITE_URL=https://willamette.edu/~jjrembold

SITE_URL for Tables

If you are trying to scrape a static table off a site, provide:
- The URL of the site as SITE_URL
- Indicate the number of which table you want as TABLE_NUM
  - Defaults to 0 for the first table on the page
The table will get converted to JSON format, with each row containing the column headers as keys
If a table is populated from a backend database on loading the webpage (a dynamic table) this method will unfortunately not work

Understanding APIs

Instead of posting a table on a webpage, information providers might make available a public API where you can access the information
Most REST APIs look just like a web address, but if you navigate to that URL, instead of getting HTML to render a webpage, you get the data directly, most often in a JSON format
- Many APIs also let you add extra information to the URL to better specify exactly what information you want back
- These generally take the form of parameter strings, and come after a ? in the URL
Some APIs will require you to register for a key, which is often free. This is to safeguard against people slamming their servers with billions of requests. Be respectful in both your API and webscraping usage!

Understanding Endpoints

Many API’s will offer a variety of different endpoints
An endpoint is essentially just a specific URL that will get you a particular piece of information
How do you know what endpoints are available?
- Classic RTFM: You read the @#%!ing manual
- Any API worth anything will have extensive documentation describing what endpoints are available and what customizations you can apply to each endpoint to get the data you want

SITE_URL for APIs

web2db is currently configured to handle REST GET APIs
- I may be able to add POST functionality if needed
You just need to provide the desired endpoint to SITE_URL
If you have parameters to include at the end (following a ?), include them in your URL
web2db assumes that the response will be in JSON, which is probably the case in over 95% of APIs, but it will break if given something in another format

HEADER KEYS

Some APIs may demand that you provide certain information in what is known as a header key
- This is an additional source of information that the API server receives along with the URL
Sometimes used to indicate authentication keys or type of response requested
Provided as a string version of a key-value dictionary:
```
"{'key_1':13, 'key_2': 'Jed'}"
```

Writing to File

If you include the FILE_NAME environment variable, the data will be written to a newline JSON file with this name in the container’s /app/output folder
This format is perfect for copying directly into Postgres tables using COPY
- Instead of having a list of entries, it has 1 entry on each new line
If you want to access these files from outside your container, you’ll need to set up a volume bind!

Writing to Database

web2db can write the information directly to your database, provided you fill out the DATABASE_URL and TABLE_NAME variables
DATABASE_URL contains all the info to connect to your database
- Form: username:password@hostname:port/database
TABLE_NAME should be a table that already exists in the database and has a column named raw_json
- It can have other columns, but web2db will only fill out the raw_json column
- I like adding a timestamp column with a DEFAULT now() to automatically add the time each entry is added to the database
If connecting to a local database, you may need to include --network=host option when running the docker container

Practice!

See if you can do the following:
- Create a table in your local database with at least one column with the jsonb data type
- Use the Docker container to dump the table containing our class schedule here into your Postgres table
- Use the Docker container to dump the results from the API endpoint here into your Postgres table. The API gives information about all the humans currently in space.
Note that although these two results have very different schema, you can dump them both into the same jsonb column in Postgres!

Break!

Break Time!

Online Deployments

Railway Introduction

There are many different webhosting options these days
For this class, I was looking for:
- Something fairly straightforward
- Something free or very cheap
- Something that could easily handle both a database and data acquisition
Railway.app is what I found

Railway Details

As a registered new user, you get $5 free credit that does not expire
- For what we are using, monthly costs are usually around $2-$2.50, so you probably will not need to pay anything to get through the end of the semester
Can spin up a Postgres database easily, in the cloud
Can also launch a Docker container
- Can build an image directly from a Dockerfile and supporting files in a GitHub repository
- Can launch an image directly from DockerHub
You can sign up for Railway using your GitHub account

Creating a DB

Creating a new Postgres database is drop dead simple:
- Create a new project
- Select “Provision PostgreSQL”
- Give it a moment, and then it will be up and running!
You can click on it in your project to bring up options where you can:
- Add tables
- Query the data directly
- Get the all important connection information
Test it! You should be able to connect using this connection information from your client of choice!

Deploying a Docker Container: Easy Way

Recently, Railway added the option to deploy a Docker container directly from an image hosted on DockerHub
This makes it very easy to deploy the jrembold/web2db image!
When it first is added to your project, it will not be running. It gives you some time to make any adjustments or corrections before you can then hit “Deploy” to get it going.

Deploying a Custom Docker Container

Railway can deploy Docker containers by building a Dockerfile in a provided GitHub repository
In your project, select “New” and then choose “GitHup Repo”
You’ll then need to choose Configure/Install GitHub App
- This will link Railway.app to your GitHub account
- You can choose to either link it to all your repositories, or, like me, choose specific repositories
- In that repository, all you need is a Dockerfile and any supporting files that you would normally need to build the image
Whenever you commit a new change to that repo, Railway will automatically rebuild the Docker image and redeploy it, which is pretty slick

Railway Environmental Variables

Railway has available all the database environment variables, but you just need to import which ones you need over to the web2db image (the DATABASE_URL one!)
Select the Docker container and then click on the “variables” tab
Here you can click the + New Variable button to add the same variables and corresponding values you had in your var.env file!
- This way you never need to have something like a password or similar saved in text as part of your deployment
- If you make changes to these variables, Railway will automatically restart the Docker container to ensure it sees the new values!

Scheduling with Railway

By design, the Docker container runs the program once and then exits
Especially on remote systems, this is an efficient use of materials
Often though, you want to scrape new data on a particular interval, so we need a way to “wake up” and run the Docker container again
This is done in Railway by setting a CRON Schedule
- This will effectively run redeploy the container and run the program at some given interval
- Requires specifying a schedule with some special CRON syntax

CRON Scheduling

From crontogo.com. Also useful is this site!

Regrouping to SQL

SHOWDOWN TIME!

Teams to be announced on next slide
Whoever want to be in charge of submitting your answer can change your polling name to a fun team name :)
Only one computer can be used by the team at a time
- We’ll be using the superhero dataset from HW4
- Another computer can have the ERD open if you want
- 4 rounds: each round switch who is typing
You’ll have 5 minutes to answer each question. After 2 minutes you can submit a single answer.
- Submitting an answer sooner gets you more points, but you have to wait at least the two minutes.

Showdown Teams

Nate, Landon, Joshue
Ryan, Kaiona, Sam P
Jace, Charlie, Sammy T
Sarah, Rohan, Kendall
Hriday, Jeffrey, Cameron

Jacob, Tatum, Logan
Andrew, Simon, Brandon
Endy, Paxton, Dane
Alexander, Dylan

A Question of Timing

Date-Time Reminders

We have about 4 current date or time related data types
Date time types:
- DATE: holds a single individual day
- TIME: holds a single individual time
- TIMESTAMP or variants with TIMESTAMPTZ: holds a combination of date and time, along with a potential time zone
Interval types:
- INTERVAL: holds a duration of time

Extracting Pieces

Having all the information in one value is convenient, but sometimes you only need pieces
- The hour from the time, or the month from the date
These can be particularly important with aggregates!
Two methods to extract pieces of any datetime or interval type:
- SQL standard: extract( part FROM |||datetime_value||| )
- Postgres specific: date_part( |||part|||, |||datetime_value||| )
Both will return a DOUBLE PRECISION value of whatever part was requested

Parts To Extract

You have a wide variety of what you can extract

text	Description
century	What century the date is in. 1st century starts 0001-01-01★
day	What day of the month
decade	The year divided by 10
dow	The day of the week (0-6, starting with Sunday)
doy	The day of the year
epoch	Number of seconds since 1970-01-01
hour	The current hour (0-23)
microseconds	The number of microseconds

text	Description
milliseconds	The number of milliseconds
minute	The minute
month	The month (1-12)
quarter	What quarter of the year (1-4)
second	The number of seconds
timezone	The timezone offset in seconds
timezone_hour	The timezone offset in hours
week	What week of the year. ISO weeks start on Monday
year	The year

★ – If you disagree with this, please write your complaint to: Pope, Cathedral Saint-Peter of Roma, Vatican.

Reversing it

Often times existing data sets have already separated out different aspects of the date or time
- Year, month, and day might be in different columns for example
It can be useful to “stitch” these together into an actual datetime type for further use.
Postgres gives you a handful of functions to do so:
- make_date( |||year|||, |||month|||, |||day|||): Returns a new DATE type value
- make_time( |||hour|||, |||minute|||, |||seconds||| ): Returns a new TIME type value (with no timezone)
- make_timestamptz(|||year|||,|||month|||,|||day|||,|||hour|||,|||minute|||,|||second|||,|||time zone|||): Returns a new TIMESTAMPTZ type value
- make_timestamp and make_interval also exist

Aging well

Subtracting two DATE type values will give just an INT (in days)
Subtracting two TIMESTAMP type values will give an INTERVAL, with the biggest “unit” in days
Using Postgres’s age() function can smooth over both and give units larger than days
- age( |||datetime₁|||, |||datetime₂||| ): Subtracts datetime2 from datetime1
This can still give you awkward interval units at times though, so also consider using justify_interval( |||interval||| ), which breaks intervals into divisions that don’t exceed a categories max
- Hours would always be between 0 and 23 for instance, or months between 1 and 12
- Especially if you want to extract a particular part, this is highly recommended

What time is it?

Standard SQL also provides constants for grabbing the current system time and date

function	description
`current_date`	Returns the current date
`current_time`	Returns the current time with timezone
`localtime`	Returns the current time without timezone
`current_timestamp`★	Returns the current date and time with timezone
`localtimestamp`	Returns the current date and time without timezone

★ – Postgres also offers the shorter now() function to do the same thing

Time Zones

Dealing with time zones can be a headache, and it is a very nice feature that Postgres can work with them smoothly
By default, Postgres will display any timestamp with a time zone with the time as you would measure it in your current system timezone
What is your current system timezone?
- SHOW timezone;
Getting general information about timezones:
- Getting abbreviations:
  - SELECT * FROM pg_timezone_abbrevs;
- Getting full names:
  - SELECT * FROM pg_timezone_names;

Teleportation

It can sometimes be useful to switch your “current” time zone
- Maybe it is easier to compare times to someone else living in that time zone
Several methods to make the switch:
- Change your postgressql.conf file, which controls your Postgres server. Only recommended if you have permanently moved elsewhere and the database time zone has not updated appropriately.
- Set future queries in a single session to be from a new timezone:
  
  SET |||timezone||| TO |||time_zone_name_or_abbrv|||;
  - This will also adjust what values your localtime or localtimestamp report!
- Transform a single query to be reported in a different time zone:
```
SELECT |||dt_col_name||| AT TIME ZONE |||tz_name_or_abbrv|||
FROM |||tablename|||;
```

Activity

Using the taxi rides dataset, see if you can:
- Compute the number of rides given each hour of the day
- Compute the average cost of rides over each day of the month
- Compute the median cost of rides over each day of the week
- Compute the average duration of each ride over each hour of the day