Interfaces to Regular Expressions

Jed Rembold

April 2, 2025

Announcements

Homework
- Homework 8 due tonight
- HW5 and HW6 feedback went out!
- HW9 goes live tonight
Project
- Normalization went pretty rough on HW6. Have your partner check your data model and ensure you are properly normalizing things!
  - I can also look at it for some feedback
- In the coming week, I’d see about getting your partner’s database tables linked into your database using a foreign data wrapper as discussed before break

Serving: Part I

APIs

One common way that you might want to make your data available is through an API
- This gives others a streamlined method to retrieve information, in an identical fashion to how you may have scraped an API initially
- API’s can do more than just read information though, they can also be used, if set up to do so, to write or change information in a database
Creating an API is a bit like creating a dynamic website:
- A server waits for a request at a particular endpoint
- The server runs SQL (or other code) to interact with the server’s database
- The server returns the desired information to the requestor

API Creation

Many scripting languages have libraries and other utilities to help create web-based APIs
- In Python, FastAPI is probably the most popular and nice to use
- In R, Plumber seems the most common
Writing a bespoke API undeniably gives you the most control over how it works, but not everyone has extensive enough scripting backgrounds
Alternatives:
- Docker-based solutions (mine uses FastAPI under the hood)
- PostgREST

PostgREST Basics

PostgREST is a service that connects to a database and automatically offers up an API to connect to desired tables
Each table (or view) essentially gets assigned an endpoint
A user accessing that endpoint receives JSON of the contents in that table

Installing PostgREST

Docker is the easiest route!
- Image is postgrest/postgrest
One primary environment variable:
- PGRST_DB_URI is the same address we’ve used several times already to connect to a database: postgres://user:password@host:port/database
Need to tell the container to use the system network
- On Windows or Linux, add the --net=host flag when running the container
- On MacOS, use host.docker.internal instead of localhost in the URI
Basic access on port 3000

Authenticating PostgREST

If you try to access and endpoint currently, you’ll get a response about how anonymous access is disabled
If you wanted the API to be able to write and update the database, you can set PostgREST up with full user authentication
For just read-access though, we can enable anonymous access. We just need to:
- Create a schema we want the accessible tables to live in
  - I like to call it something like api
- Create a database user for the anonymous accessor
- Give that database user permission to (read) access the schema and tables in that schema

Postgres User Permissions

A fundamental part of database management is user management, which are called roles in Postgres
You can create new roles with various permission attributes
```
CREATE ROLE |||name||| |||attributes|||;
```
You can see all existing roles in the build-in table!
```
SELECT * FROM pg_roles;
```
A little confusingly, roles can represent either individuals or groups
- You can grant one role to another (inheriting that roles permissions) to create groups

PostgREST Roles

For anonymous validation, we can create a role that is not allowed to login
```
CREATE ROLE web_anon NOLOGIN;
```
Note that this will NOT change the PGRST_DB_URI login. It isn’t a bad idea to login though with a non-superuser account. So, optionally:
```
CREATE ROLE web_auth 
  NOINHERIT LOGIN PASSWORD 'secretsecret';
GRANT web_anon TO web_auth;
```
then use that for your URI:
```
postgresql://web_auth:secretsecret@localhost:port/database
```

GRANTING Permissions

Roles by themselves have very limited permissions! Need to explicitly grant them (unless they inherited them elsewhere)
We need to GRANT USAGE rights to schemas
```
GRANT USAGE ON SCHEMA api TO web_anon;
```

We need to GRANT SELECT rights to tables/views

GRANT SELECT ON api.|||table name||| TO web_anon;

Updating Environment Variables

Need to add a few more environment variables:
- PGRST_DB_SCHEMA: The schema that we want PostgREST to generate endpoints for (and the one we just gave permissions to
- PGRST_DB_ANON_ROLE: The role we created and gave the permissions to (web_anon)
With those specified, we should be able to then access our endpoint!

Documentation

It is often a good idea to document your APIs, so that someone can tell what you are making available
PostgREST can create a basic layout for you using the OpenAPI documentation syntax
We can hook another service into this (SwaggerUI is common) to have a website where we can view our lovely documentation
Requires setting another environment variable in PostgREST to make this available:
- PGRST_OPENAPI_SERVER_PROXY_URI: Set to localhost or to your PostgREST server URL

SwaggerUI

There are many website documentation programs that can “talk” to the OpenAPI service to discover what endpoints your API offers and display them in a pretty fashion
A common one is SwaggerUI, installed most easy through (you guessed it) Docker
- swaggerapi/swagger-ui on DockerHub
- Requires setting 1 environment variable: API_URL which should match the proxy URI we just defined for PostgREST
- “Poke a Docker hole” in port 8080 to then be able to access the documentation

Break!

Break Time!

String Basics

Stringy Functions (CORE)

Function	Description
`\|\|\|str₁\|\|\| \|\| \|\|\|str₂\|\|\|`	Concatenates string 1 and string 2 together
`upper(\|\|\|str\|\|\|)`	Converts a string to all uppercase characters
`lower(\|\|\|str\|\|\|)`	Converts a string to all lowercase characters
`char_length(\|\|\|str\|\|\|)`	Returns the number of characters in the string
`position(\|\|\|str\|\|\| IN \|\|\|substr\|\|\|)`	Find the number of the character where the substring begins
`trim(\|\|\|opt\|\|\| \|\|\|chr\|\|\| FROM \|\|\|str\|\|\|)`	Removes the given characters from the string, optionally taking from the leading or trailing edge
`substring(\|\|\|str\|\|\| FROM \|\|\|n\|\|\| FOR \|\|\|l\|\|\|)`	Returns the portion of the string starting at position n and continuing for l characters

String Functions (Postgres)

Function	Description
`initcap(\|\|\|str\|\|\|)`	Converts the first character of each word to uppercase, and the rest lower
`left(\|\|\|str\|\|\|,\|\|\|n\|\|\|)`	Returns the first n characters of the string
`right(\|\|\|str\|\|\|,\|\|\|n\|\|\|)`	Returns the last n characters of the string
`ltrim(\|\|\|str\|\|\|,\|\|\|chr\|\|\|)`	Remove the characters (space by default) from the start of the string
`rtrim(\|\|\|str\|\|\|,\|\|\|chr\|\|\|)`	Remove the characters (space by default) from the end of the string
`replace(\|\|\|str\|\|\|,\|\|\|from\|\|\|,\|\|\|to\|\|\|)`	Replaces all occurrences of from in the string to
`length(\|\|\|str\|\|\|)`	Returns the number of characters in the string
`substr(\|\|\|str\|\|\|, \|\|\|n\|\|\|, \|\|\|l\|\|\|)`	Returns the portion of the string starting at position n and continuing l characters

Regular Expressions

Enhanced Pattern Matching

We’ve already seen basic pattern matching with LIKE and LIKE
- Some flexibility with wildcard characters: % and _
To get (much) more flexibility, we need to pivot to something made for exactly this purpose: regular expressions (or regex)
Regular expressions are a sequence of mostly single character symbols that denote exactly what patterns one could wish for
- These sequences of characters can initially look very inscrutable! Stick with it!
Regex’s are useful all over, and supported in almost all programming languages as well. Learning at least the basics is time very well spent.

Basic Regex Terms

Expression	Description
`.`	Matches any character except a new line
`[abc]`	Matches any character in the square brackets (a or b or c)
`[a-z]`	Matches a range of characters (all lowercase letters here)
`[^a-z]`	Caret negates what follows (so no lowercase letters here)
`\w`	Any word character, digit or underscore
`\d`	Any digit
`\s`	A space
`\t`	A tab character
`\n`	A newline character

Expression	Description
`^`	Match at the start of the string
`$`	Match at the end of the string
`?`	Get the preceding match 0 or one time
`*`	Get the preceding match zero or more times
`+`	Get the preceding match one or more times
`{m}`	Get the preceding match exactly m times
`{m,n}`	Get the preceding match between m and n times
`a\|b`	Match on either a or b
`( )`	Create a capture group or set precedence
`(?: )`	Negate reporting a capture group

Other Regex Concepts

If you ever want to match off a symbol that has special meaning in regex (a parenthese, for instance) you must escape it with a backslash: \(
Reserved characters include: { } [ ] / \ + * . $ ^ | ?
Flags can be added at the end to tweak matching
- /i means that matches will be case insensitive
- /g means that all instances of the match will be returned, not just the first
- /m allows the anchor characters (^ and $) to operate on each line, not just across the entire string.

Activity!

Practicing with regular expressions is the only way to improve at your ability to see the possible patterns
The link here has a nice sequence of short problems to test your skills against
Most problems consist of:
- Terms that you want to match correctly
- Terms that you want to not match
- Capture groups that you’d like to capture
Working with your neighbors, see how many you can figure out in the next 20 minutes

Back to SQL

One of the main ways we previously used pattern matching was for filtering
You can also use regexes for filtering!
- ~ is a case sensitive match using the following regex
- ~* is a case insensitive match using the following regex
- Either can have a ! in front to negate the search (where things do not match the regex)

SELECT colname
FROM tablename
WHERE colname ~ '[a-z]*\s\d{2}';

Extracting Data

Another hugely common use of regex is to extract only the data you want from a much larger string
This can be particularly useful when cleaning data or constructing useful database tables
Since PostgreSQL 15, you have had a perfect tool to accomplish this easily: regexp_substr
```
SELECT
  regexp_substr('today is March 21, 2024', '\d{4}');
```

Further Extracting Data

What if you have multiple capture groups though?
regexp_match(|||str|||, |||regex|||) returns the first matching instance in the string
- What is returned is whatever is in any capture groups you may have included in your regex, or the entire match if there are no capture groups
- Output is returned as an array, to allow for potentially multiple capture groups
- Index it out specific values using [N] at the end after wrapping entire expression in ()
- Can also UNNEST

SELECT (regexp_match('today is March 15, 2022', '\d{4}'))[1];

Regular Splitting

You can also use regular expressions to replace or split text
regexp_replace(|||text|||, |||re|||, |||replacement|||) will replace the first matches of re in text with the replacement
- Add another 'g' argument on the end to do so globally (replacing all matches)
regexp_split_to_array(|||text|||, |||re|||) will split the text into an array on each match to re
regexp_split_to_table(|||text|||, |||re|||) will split the text into a table column on each match to re. This is just like unnesting the array.

Now to You!

With your neighbors, take a look at this CSV file, which contains a simple subset of artists and dimensions from the MoMA data set
Create the simple table and import in the data from the CSV
See if you can achieve the following:
- Create and populate new columns for first, middle, and last name
- Create and populate new columns to hold the width and height in inches (in 1 1/4 form)
- If the above was easy and you finish quickly, create new columns to hold decimal equivalents of the width and height

Full Text Tools

Full Text Search

Pattern matching or regular expressions are great for parsing longer text for particular known patterns
What if you want more general information about the text within a longer passage?
- We need some different tools
Using these new tools will require using some new functions and data types:
- ts_vector
- ts_query

Text To Vector

English (and other languages) have many connecting words that help convey meaning, intent, or relationships
These are commonly not what are desired when searching text for particular ideas
Instead it is useful to focus on lexemes
- A lexeme is an abstract unit of meaning that underlies a set of words
- RUN: run, runs, ran, running
Postgres’s to_tsvector will break a string down into its component lexemes, and keep track of where each occured in the original string

SELECT to_tsvector('I flew back to Salem on Saturday');

>> 'back':3 'flew':2 'saturday':7 'salem':5

Lexeme Queries

Text that you want to search through will need to be converted to a vector through to_tsvector
Text that you want to search for will be converted using to_tsquery
to_tsquery takes a sequence of words with symbols connecting them conveying relationships
- & - and
- | - or
- ! - not
- <-> - followed by
```
SELECT to_tsquery('fly & monday')
```

Combining Vectors and Queries

To actually complete a full text search, you ask Postgres to look through a tsvector object for a particular tsquery
The syntax to do so utilizes the match operator, which is two “at” symbols: @@
```
SELECT |||some_tsvector||| @@ |||some_tsquery|||;
```
Using @@ is a true/false search, so the query is either found or not
- This means it can be used for filtering with WHERE as well!

Indexing TSVectors

Individual tsvectors can not be easily ordered, so indexing a column with tsvector contents using the normal B-Tree method would not be effective
For this sort of content, use a Generalized Inverted Index or GIN indexing method instead
```
CREATE INDEX |||idx_name||| ON |||tab_name||| 
USING GIN (|||col_name|||);
```
Seriously consider adding your tsvector column as an index, as it can significantly speed up these sorts of searches

Getting More Info

Sometimes it can be useful to get a bit more information about where a match shows up in the text
The ts_headline function can capture snippets of text around a match and display them
- ts_headline operates on the original text, not the tsvector! This will absolutely make it slower, so use it wisely!
There are a few required parameters and some options for ts_headline:
- the text to search
- the tsquery to look for
- Other options appear in an option string:
  - StartSel/StopSel: the delimiters that will showcase the word
  - MinWords/MaxWords: the min or max number of words to show around the match
  - MaxFragments: a max number of fragments to show if the match occurs multiple times

Ranking

Sometimes a query might return many results, such that you would want a method to rank or compare them and only select the “best”
Ranking by relevancy is vague and tends to be very application specific, but it can be useful
ts_rank will give an arbitrary rank based on how many times your query words appear in the text
- This might mean that longer texts will always receive a higher rank! You can normalize by the text length by providing an extra numeric code as a third argument (2 or 8 probably best)
ts_rank_cd does similar, but also considers the proximity of searched lexemes
Both functions take 2 required arguments:
- a tsvector of the contents to be ranked
- a tsquery determining how they will be ranked

Your Turn!

The file here contains the SQL commands to generate and populate a simple table alice which hold the raw chapter contents of the book: Alice in Wonderland. You will need to set up your own tsvector column and index. With yet one more new partner, see if you can use the data to answer the following:

In what chapters does the “Chesire cat” appear?
In what chapter does the word “mushroom” appear the most? How many times does it appear?