Interfaces to Regular Expressions

Jed Rembold

April 2, 2025

Announcements

  • Homework
    • Homework 8 due tonight
    • HW5 and HW6 feedback went out!
    • HW9 goes live tonight
  • Project
    • Normalization went pretty rough on HW6. Have your partner check your data model and ensure you are properly normalizing things!
      • I can also look at it for some feedback
    • In the coming week, I’d see about getting your partner’s database tables linked into your database using a foreign data wrapper as discussed before break

Serving: Part I

APIs

  • One common way that you might want to make your data available is through an API
    • This gives others a streamlined method to retrieve information, in an identical fashion to how you may have scraped an API initially
    • API’s can do more than just read information though, they can also be used, if set up to do so, to write or change information in a database
  • Creating an API is a bit like creating a dynamic website:
    • A server waits for a request at a particular endpoint
    • The server runs SQL (or other code) to interact with the server’s database
    • The server returns the desired information to the requestor

API Creation

  • Many scripting languages have libraries and other utilities to help create web-based APIs
    • In Python, FastAPI is probably the most popular and nice to use
    • In R, Plumber seems the most common
  • Writing a bespoke API undeniably gives you the most control over how it works, but not everyone has extensive enough scripting backgrounds
  • Alternatives:
    • Docker-based solutions (mine uses FastAPI under the hood)
    • PostgREST

PostgREST Basics

  • PostgREST is a service that connects to a database and automatically offers up an API to connect to desired tables
  • Each table (or view) essentially gets assigned an endpoint
  • A user accessing that endpoint receives JSON of the contents in that table

More Advanced PostgREST

  • PostgREST also offers up a slew of endpoint options that can be included to further filter the data returned
    • This is actually a huge boon, as such capabilities otherwise have to be manually coded
  • PostgREST can not only read from the database, but perform most CRUD operations (create, read, update, or delete)
    • If you are going to let the API edit the database though, you should DEFINITELY implement access controls
    • PostgREST can help you handle this without a ton of effort, but it does complicate things.
    • My suggestion for the project is to just implement various useful read-only endpoints

Installing PostgREST

  • Docker is the easiest route!
    • Image is postgrest/postgrest
  • One primary environment variable:
    • PGRST_DB_URI is the same address we’ve used several times already to connect to a database: postgres://user:password@host:port/database
  • Need to tell the container to use the system network
    • On Windows or Linux, add the --net=host flag when running the container
    • On MacOS, use host.docker.internal instead of localhost in the URI
  • Basic access on port 3000

Authenticating PostgREST

  • If you try to access and endpoint currently, you’ll get a response about how anonymous access is disabled
  • If you wanted the API to be able to write and update the database, you can set PostgREST up with full user authentication
  • For just read-access though, we can enable anonymous access. We just need to:
    • Create a schema we want the accessible tables to live in
      • I like to call it something like api
    • Create a database user for the anonymous accessor
    • Give that database user permission to (read) access the schema and tables in that schema

Postgres User Permissions

  • A fundamental part of database management is user management, which are called roles in Postgres

  • You can create new roles with various permission attributes

    CREATE ROLE |||name||| |||attributes|||;
  • You can see all existing roles in the build-in table!

    SELECT * FROM pg_roles;
  • A little confusingly, roles can represent either individuals or groups

    • You can grant one role to another (inheriting that roles permissions) to create groups

PostgREST Roles

  • For anonymous validation, we can create a role that is not allowed to login

    CREATE ROLE web_anon NOLOGIN;
  • Note that this will NOT change the PGRST_DB_URI login. It isn’t a bad idea to login though with a non-superuser account. So, optionally:

    CREATE ROLE web_auth 
      NOINHERIT LOGIN PASSWORD 'secretsecret';
    GRANT web_anon TO web_auth;

    then use that for your URI:

    postgresql://web_auth:secretsecret@localhost:port/database

GRANTING Permissions

  • Roles by themselves have very limited permissions! Need to explicitly grant them (unless they inherited them elsewhere)

  • We need to GRANT USAGE rights to schemas

    GRANT USAGE ON SCHEMA api TO web_anon;
  • We need to GRANT SELECT rights to tables/views

    GRANT SELECT ON api.|||table name||| TO web_anon;

Updating Environment Variables

  • Need to add a few more environment variables:
    • PGRST_DB_SCHEMA: The schema that we want PostgREST to generate endpoints for (and the one we just gave permissions to
    • PGRST_DB_ANON_ROLE: The role we created and gave the permissions to (web_anon)
  • With those specified, we should be able to then access our endpoint!

Documentation

  • It is often a good idea to document your APIs, so that someone can tell what you are making available
  • PostgREST can create a basic layout for you using the OpenAPI documentation syntax
  • We can hook another service into this (SwaggerUI is common) to have a website where we can view our lovely documentation
  • Requires setting another environment variable in PostgREST to make this available:
    • PGRST_OPENAPI_SERVER_PROXY_URI: Set to localhost or to your PostgREST server URL

SwaggerUI

  • There are many website documentation programs that can “talk” to the OpenAPI service to discover what endpoints your API offers and display them in a pretty fashion
  • A common one is SwaggerUI, installed most easy through (you guessed it) Docker
    • swaggerapi/swagger-ui on DockerHub
    • Requires setting 1 environment variable: API_URL which should match the proxy URI we just defined for PostgREST
    • “Poke a Docker hole” in port 8080 to then be able to access the documentation

Break!

Break Time!

String Basics

Stringy Functions (CORE)

Function Description
|||str₁||| || |||str₂||| Concatenates string 1 and string 2 together
upper(|||str|||) Converts a string to all uppercase characters
lower(|||str|||) Converts a string to all lowercase characters
char_length(|||str|||) Returns the number of characters in the string
position(|||str||| IN |||substr|||) Find the number of the character where the substring begins
trim(|||opt||| |||chr||| FROM |||str|||) Removes the given characters from the string, optionally taking from the leading or trailing edge
substring(|||str||| FROM |||n||| FOR |||l|||) Returns the portion of the string starting at position n and continuing for l characters

String Functions (Postgres)

Function Description
initcap(|||str|||) Converts the first character of each word to uppercase, and the rest lower
left(|||str|||,|||n|||) Returns the first n characters of the string
right(|||str|||,|||n|||) Returns the last n characters of the string
ltrim(|||str|||,|||chr|||) Remove the characters (space by default) from the start of the string
rtrim(|||str|||,|||chr|||) Remove the characters (space by default) from the end of the string
replace(|||str|||,|||from|||,|||to|||) Replaces all occurrences of from in the string to
length(|||str|||) Returns the number of characters in the string
substr(|||str|||, |||n|||, |||l|||) Returns the portion of the string starting at position n and continuing l characters

Regular Expressions

Enhanced Pattern Matching

  • We’ve already seen basic pattern matching with LIKE and LIKE
    • Some flexibility with wildcard characters: % and _
  • To get (much) more flexibility, we need to pivot to something made for exactly this purpose: regular expressions (or regex)
  • Regular expressions are a sequence of mostly single character symbols that denote exactly what patterns one could wish for
    • These sequences of characters can initially look very inscrutable! Stick with it!
  • Regex’s are useful all over, and supported in almost all programming languages as well. Learning at least the basics is time very well spent.

Basic Regex Terms

Expression Description
. Matches any character except a new line
[abc] Matches any character in the square brackets (a or b or c)
[a-z] Matches a range of characters (all lowercase letters here)
[^a-z] Caret negates what follows (so no lowercase letters here)
\w Any word character, digit or underscore
\d Any digit
\s A space
\t A tab character
\n A newline character
Expression Description
^ Match at the start of the string
$ Match at the end of the string
? Get the preceding match 0 or one time
* Get the preceding match zero or more times
+ Get the preceding match one or more times
{m} Get the preceding match exactly m times
{m,n} Get the preceding match between m and n times
a|b Match on either a or b
( ) Create a capture group or set precedence
(?: ) Negate reporting a capture group

Other Regex Concepts

  • If you ever want to match off a symbol that has special meaning in regex (a parenthese, for instance) you must escape it with a backslash: \(
  • Reserved characters include: { } [ ] / \ + * . $ ^ | ?
  • Flags can be added at the end to tweak matching
    • /i means that matches will be case insensitive
    • /g means that all instances of the match will be returned, not just the first
    • /m allows the anchor characters (^ and $) to operate on each line, not just across the entire string.

Activity!

  • Practicing with regular expressions is the only way to improve at your ability to see the possible patterns
  • The link here has a nice sequence of short problems to test your skills against
  • Most problems consist of:
    • Terms that you want to match correctly
    • Terms that you want to not match
    • Capture groups that you’d like to capture
  • Working with your neighbors, see how many you can figure out in the next 20 minutes

Back to SQL

  • One of the main ways we previously used pattern matching was for filtering
  • You can also use regexes for filtering!
    • ~ is a case sensitive match using the following regex
    • ~* is a case insensitive match using the following regex
    • Either can have a ! in front to negate the search (where things do not match the regex)
SELECT colname
FROM tablename
WHERE colname ~ '[a-z]*\s\d{2}';

Extracting Data

  • Another hugely common use of regex is to extract only the data you want from a much larger string

  • This can be particularly useful when cleaning data or constructing useful database tables

  • Since PostgreSQL 15, you have had a perfect tool to accomplish this easily: regexp_substr

    SELECT
      regexp_substr('today is March 21, 2024', '\d{4}');

Further Extracting Data

  • What if you have multiple capture groups though?
  • regexp_match(|||str|||, |||regex|||) returns the first matching instance in the string
    • What is returned is whatever is in any capture groups you may have included in your regex, or the entire match if there are no capture groups
    • Output is returned as an array, to allow for potentially multiple capture groups
    • Index it out specific values using [N] at the end after wrapping entire expression in ()
    • Can also UNNEST
SELECT (regexp_match('today is March 15, 2022', '\d{4}'))[1];

Regular Splitting

  • You can also use regular expressions to replace or split text
  • regexp_replace(|||text|||, |||re|||, |||replacement|||) will replace the first matches of re in text with the replacement
    • Add another 'g' argument on the end to do so globally (replacing all matches)
  • regexp_split_to_array(|||text|||, |||re|||) will split the text into an array on each match to re
  • regexp_split_to_table(|||text|||, |||re|||) will split the text into a table column on each match to re. This is just like unnesting the array.

Now to You!

  • With your neighbors, take a look at this CSV file, which contains a simple subset of artists and dimensions from the MoMA data set
  • Create the simple table and import in the data from the CSV
  • See if you can achieve the following:
    • Create and populate new columns for first, middle, and last name
    • Create and populate new columns to hold the width and height in inches (in 1 1/4 form)
    • If the above was easy and you finish quickly, create new columns to hold decimal equivalents of the width and height

Full Text Tools

Text To Vector

  • English (and other languages) have many connecting words that help convey meaning, intent, or relationships
  • These are commonly not what are desired when searching text for particular ideas
  • Instead it is useful to focus on lexemes
    • A lexeme is an abstract unit of meaning that underlies a set of words
    • RUN: run, runs, ran, running
  • Postgres’s to_tsvector will break a string down into its component lexemes, and keep track of where each occured in the original string
SELECT to_tsvector('I flew back to Salem on Saturday');

>> 'back':3 'flew':2 'saturday':7 'salem':5

Lexeme Queries

  • Text that you want to search through will need to be converted to a vector through to_tsvector
  • Text that you want to search for will be converted using to_tsquery
  • to_tsquery takes a sequence of words with symbols connecting them conveying relationships
    • & - and
    • | - or
    • ! - not
    • <-> - followed by
    SELECT to_tsquery('fly & monday')

Combining Vectors and Queries

  • To actually complete a full text search, you ask Postgres to look through a tsvector object for a particular tsquery

  • The syntax to do so utilizes the match operator, which is two “at” symbols: @@

    SELECT |||some_tsvector||| @@ |||some_tsquery|||;
  • Using @@ is a true/false search, so the query is either found or not

    • This means it can be used for filtering with WHERE as well!

Indexing TSVectors

  • Individual tsvectors can not be easily ordered, so indexing a column with tsvector contents using the normal B-Tree method would not be effective

  • For this sort of content, use a Generalized Inverted Index or GIN indexing method instead

    CREATE INDEX |||idx_name||| ON |||tab_name||| 
    USING GIN (|||col_name|||);
  • Seriously consider adding your tsvector column as an index, as it can significantly speed up these sorts of searches

Getting More Info

  • Sometimes it can be useful to get a bit more information about where a match shows up in the text
  • The ts_headline function can capture snippets of text around a match and display them
    • ts_headline operates on the original text, not the tsvector! This will absolutely make it slower, so use it wisely!
  • There are a few required parameters and some options for ts_headline:
    • the text to search
    • the tsquery to look for
    • Other options appear in an option string:
      • StartSel/StopSel: the delimiters that will showcase the word
      • MinWords/MaxWords: the min or max number of words to show around the match
      • MaxFragments: a max number of fragments to show if the match occurs multiple times

Ranking

  • Sometimes a query might return many results, such that you would want a method to rank or compare them and only select the “best”
  • Ranking by relevancy is vague and tends to be very application specific, but it can be useful
  • ts_rank will give an arbitrary rank based on how many times your query words appear in the text
    • This might mean that longer texts will always receive a higher rank! You can normalize by the text length by providing an extra numeric code as a third argument (2 or 8 probably best)
  • ts_rank_cd does similar, but also considers the proximity of searched lexemes
  • Both functions take 2 required arguments:
    • a tsvector of the contents to be ranked
    • a tsquery determining how they will be ranked

Your Turn!

The file here contains the SQL commands to generate and populate a simple table alice which hold the raw chapter contents of the book: Alice in Wonderland. You will need to set up your own tsvector column and index. With yet one more new partner, see if you can use the data to answer the following:

  • In what chapters does the “Chesire cat” appear?
  • In what chapter does the word “mushroom” appear the most? How many times does it appear?
// reveal.js plugins // Added plugins