---
title: "Interfaces to Regular Expressions"
author: Jed Rembold
date: April 2, 2025
slideNumber: true
theme: nord_light
highlightjs-theme: nord
width: 1920
height: 1080
transition: slide
---


## Announcements
- Homework
  - Homework 8 due tonight
  - HW5 and HW6 feedback went out!
  - HW9 goes live tonight
- Project
  - Normalization went pretty rough on HW6. Have your partner check your data model and ensure you are properly normalizing things!
    - I can also look at it for some feedback
  - In the coming week, I'd see about getting your partner's database tables linked into your database using a foreign data wrapper as discussed before break

# Serving: Part I
## APIs
- One common way that you might want to make your data available is through an API
  - This gives others a streamlined method to retrieve information, in an identical fashion to how you may have scraped an API initially
  - API's can do more than just read information though, they can also be used, if set up to do so, to write or change information in a database
- Creating an API is a bit like creating a dynamic website:
  - A server waits for a request at a particular endpoint
  - The server runs SQL (or other code) to interact with the server's database
  - The server returns the desired information to the requestor


## API Creation
- Many scripting languages have libraries and other utilities to help create web-based APIs
  - In Python, [FastAPI](https://fastapi.tiangolo.com/) is probably the most popular and nice to use
  - In R, [Plumber](https://www.rplumber.io/) seems the most common
- Writing a bespoke API undeniably gives you the most control over how it works, but not everyone has extensive enough scripting backgrounds
- Alternatives:
  - Docker-based solutions ([mine](https://github.com/rembold-data-engineering-master/db_serving_api_template/) uses FastAPI under the hood)
  - [PostgREST](https://docs.postgrest.org/en/v12/)


## PostgREST Basics
- PostgREST is a service that connects to a database and automatically offers up an API to connect to desired tables
- Each table (or view) essentially gets assigned an endpoint
- A user accessing that endpoint receives JSON of the contents in that table


## More Advanced PostgREST 
- PostgREST also offers up a slew of endpoint [options](https://docs.postgrest.org/en/v12/references/api/tables_views.html#operators) that can be included to further filter the data returned
  - This is actually a huge boon, as such capabilities otherwise have to be manually coded
- PostgREST can not only read from the database, but perform most CRUD operations (create, read, update, or delete)
  - If you are going to let the API edit the database though, you should DEFINITELY implement access controls
  - PostgREST can help you handle this without a ton of effort, but it does complicate things.
  - My suggestion for the project is to just implement various useful read-only endpoints


## Installing PostgREST
- [Docker](https://docs.postgrest.org/en/v12/explanations/install.html#docker) is the easiest route!
  - Image is `postgrest/postgrest`
- One primary environment variable:
  - `PGRST_DB_URI` is the same address we've used several times already to connect to a database: `postgres://user:password@host:port/database`{.text}
- Need to tell the container to use the system network
  - On Windows or Linux, add the `--net=host`{.text} flag when running the container
  - On MacOS, use `host.docker.internal` instead of `localhost` in the URI
- Basic access on port 3000


## Authenticating PostgREST
- If you try to access and endpoint currently, you'll get a response about how anonymous access is disabled
- If you wanted the API to be able to write and update the database, you can set PostgREST up with full user [authentication](https://docs.postgrest.org/en/v12/references/auth.html)
- For just read-access though, we can enable anonymous access. We just need to:
  - Create a schema we want the accessible tables to live in
    - I like to call it something like `api`
  - Create a database user for the anonymous accessor
  - Give that database user permission to (read) access the schema and tables in that schema


## Postgres User Permissions
- A fundamental part of database management is user management, which are called [roles](https://www.postgresql.org/docs/current/database-roles.html) in Postgres
- You can create new roles with various permission attributes
  ```postgresql
  CREATE ROLE |||name||| |||attributes|||;
  ```
- You can see all existing roles in the build-in table!
  ```postgresql
  SELECT * FROM pg_roles;
  ```
- A little confusingly, roles can represent either individuals or groups
  - You can grant one role to another (inheriting that roles permissions) to create groups


## PostgREST Roles
- For anonymous validation, we can create a role that is not allowed to login
  ```{.postgresql style='font-size:.9em'}
  CREATE ROLE web_anon NOLOGIN;
  ```
- Note that this will NOT change the `PGRST_DB_URI` login. It isn't a bad idea to login though with a non-superuser account. So, **optionally**:
  ```{.postgresql style='font-size:.9em'}
  CREATE ROLE web_auth 
    NOINHERIT LOGIN PASSWORD 'secretsecret';
  GRANT web_anon TO web_auth;
  ```
  then use that for your URI:
  ```{.text style='font-size:.9em'}
  postgresql://web_auth:secretsecret@localhost:port/database
  ```

## GRANTING Permissions
- Roles by themselves have very limited permissions! Need to explicitly [grant](https://www.postgresql.org/docs/17/sql-grant.html) them (unless they inherited them elsewhere)
- We need to `GRANT` `USAGE` rights to schemas
  ```postgresql
  GRANT USAGE ON SCHEMA api TO web_anon;
  ```
- We need to `GRANT` `SELECT` rights to tables/views
  ```postgresql
  GRANT SELECT ON api.|||table name||| TO web_anon;
  ```


## Updating Environment Variables
- Need to add a few more environment variables:
  - `PGRST_DB_SCHEMA`: The schema that we want PostgREST to generate endpoints for (and the one we just gave permissions to
  - `PGRST_DB_ANON_ROLE`: The role we created and gave the permissions to (`web_anon`)
- With those specified, we should be able to then access our endpoint!


## Documentation
- It is often a good idea to document your APIs, so that someone can tell what you are making available
- PostgREST can create a basic layout for you using the OpenAPI documentation syntax
- We can hook another service into this (SwaggerUI is common) to have a website where we can view our lovely documentation
- Requires setting another environment variable in PostgREST to make this available:
  - `PGRST_OPENAPI_SERVER_PROXY_URI`: Set to localhost or to your PostgREST server URL

## SwaggerUI
- There are many website documentation programs that can "talk" to the OpenAPI service to discover what endpoints your API offers and display them in a pretty fashion
- A common one is SwaggerUI, installed most easy through (you guessed it) Docker
  - `swaggerapi/swagger-ui` on DockerHub
  - Requires setting 1 environment variable: `API_URL` which should match the proxy URI we just defined for PostgREST
  - "Poke a Docker hole" in port 8080 to then be able to access the documentation


# Break!
## Break Time!
![](../images/cat_eating.gif)


<!--------  SQL ------------------>
# String Basics
## Stringy Functions (CORE)

:::{style='font-size:.9em'}

| Function                  | Description                                                                                           |
|-----------------------------------|---------------------------------------------|
| `|||str₁||| || |||str₂|||` | Concatenates string 1 and string 2 together |
| `upper(|||str|||)`              | Converts a string to all uppercase characters                                                         |
| `lower(|||str|||)`              | Converts a string to all lowercase characters                                                         |
| `char_length(|||str|||)`        | Returns the number of characters in the string                                                        |
| `position(|||str||| IN |||substr|||)` | Find the number of the character where the substring begins                                           |
| `trim(|||opt||| |||chr||| FROM |||str|||)`  | Removes the given characters from the string, optionally taking from the _leading_ or _trailing_ edge |
| `substring(|||str||| FROM |||n||| FOR |||l|||)` | Returns the portion of the string starting at position n and continuing for l characters |

:::

## String Functions (Postgres)

:::{style='font-size:.9em'}

| Function                  | Description                                                                              |
|--------------------------------|------------------------------------------------------------------------------------------|
| `initcap(|||str|||)`            | Converts the first character of each word to uppercase, and the rest lower               |
| `left(|||str|||,|||n|||)`             | Returns the first n characters of the string                                             |
| `right(|||str|||,|||n|||)`            | Returns the last n characters of the string                                              |
| `ltrim(|||str|||,|||chr|||)`          | Remove the characters (space by default) from the start of the string                    |
| `rtrim(|||str|||,|||chr|||)`          | Remove the characters (space by default) from the end of the string                      |
| `replace(|||str|||,|||from|||,|||to|||)`    | Replaces all occurrences of _from_ in the string _to_                                   |
| `length(|||str|||)`             | Returns the number of characters in the string                                           |
| `substr(|||str|||, |||n|||, |||l|||)` | Returns the portion of the string starting at position n and continuing l characters |

:::


# Regular Expressions
## Enhanced Pattern Matching
- We've already seen basic pattern matching with `LIKE` and `LIKE`
  - Some flexibility with wildcard characters: `%` and `_`
- To get (much) more flexibility, we need to pivot to something made for exactly this purpose: _regular expressions_ (or _regex_)
- Regular expressions are a sequence of mostly single character symbols that denote exactly what patterns one could wish for
  - These sequences of characters can initially look very inscrutable! Stick with it!
- Regex's are useful all over, and supported in almost all programming languages as well. Learning at least the basics is time very well spent.


## Basic Regex Terms

:::{style='font-size:.7em'}

::::::cols
::::col

| Expression | Description                                                |
|------------|------------------------------------------------------------|
| `.`        | Matches _any_ character except a new line                  |
| `[abc]`    | Matches any character in the square brackets (a or b or c) |
| `[a-z]`    | Matches a range of characters (all lowercase letters here) |
| `[^a-z]`   | Caret negates what follows (so no lowercase letters here)  |
| `\w`       | Any word character, digit or underscore                    |
| `\d`       | Any digit                                                  |
| `\s`       | A space                                                    |
| `\t`       | A tab character                                            |
| `\n`       | A newline character                                        |

::::

::::col

| Expression | Description                                   |
|------------|-----------------------------------------------|
| `^`        | Match at the start of the string                           |
| `$`        | Match at the end of the string                             |
| `?`        | Get the preceding match 0 or one time         |
| `*`        | Get the preceding match zero or more times    |
| `+`        | Get the preceding match one or more times     |
| `{m}`      | Get the preceding match exactly m times       |
| `{m,n}`    | Get the preceding match between m and n times |
| `a|b`      | Match on either a or b                        |
| `( )`      | Create a capture group or set precedence      |
| `(?: )`    | Negate reporting a capture group              |
::::
::::::

:::

## Other Regex Concepts
- If you ever want to match off a symbol that has special meaning in regex (a parenthese, for instance) you must _escape it_ with a backslash: `\(`
- Reserved characters include: `{ } [ ] / \ + * . $ ^ | ?`
- Flags can be added at the end to tweak matching
  - `/i` means that matches will be case insensitive
  - `/g` means that all instances of the match will be returned, not just the first
  - `/m` allows the anchor characters (`^` and `$`) to operate on each line, not just across the entire string.


## {data-background-iframe="https://regexr.com/6hfai"}

## Activity!
- Practicing with regular expressions is the only way to improve at your ability to see the possible patterns
- The link [here](https://regexone.com/problem/matching_phone_numbers) has a nice sequence of short problems to test your skills against
- Most problems consist of:
  - Terms that you want to match correctly
  - Terms that you want to **not** match
  - Capture groups that you'd like to capture
- Working with your neighbors, see how many you can figure out in the next 20 minutes


## Back to SQL
- One of the main ways we previously used pattern matching was for filtering
- You can also use regexes for filtering!
  - `~` is a case sensitive match using the following regex
  - `~*` is a case insensitive match using the following regex
  - Either can have a `!` in front to negate the search (where things do **not** match the regex)

```pgsql
SELECT colname
FROM tablename
WHERE colname ~ '[a-z]*\s\d{2}';
```

## Extracting Data
- Another hugely common use of regex is to extract only the data you want from a much larger string
- This can be particularly useful when cleaning data or constructing useful database tables
- Since PostgreSQL 15, you have had a perfect tool to accomplish this easily: `regexp_substr`
  ```pgsql
  SELECT
    regexp_substr('today is March 21, 2024', '\d{4}');
  ```

## Further Extracting Data
- What if you have multiple capture groups though?
- `regexp_match(|||str|||, |||regex|||)` returns the first matching instance in the string
  - What is returned is whatever is in any _capture groups_ you may have included in your regex, or the entire match if there are no capture groups
  - Output is returned as an array, to allow for potentially multiple capture groups
  - Index it out specific values using `[N]` at the end after wrapping entire expression in ()
  - Can also `UNNEST`

```{.pgsql style='font-size:.9em'}
SELECT (regexp_match('today is March 15, 2022', '\d{4}'))[1];
```

## Regular Splitting
- You can also use regular expressions to replace or split text
- `regexp_replace(|||text|||, |||re|||, |||replacement|||)` will replace the first matches of re in text with the replacement
  - Add another `'g'` argument on the end to do so globally (replacing all matches)
- `regexp_split_to_array(|||text|||, |||re|||)` will split the text into an array on each match to re
- `regexp_split_to_table(|||text|||, |||re|||)` will split the text into a table column on each match to re. This is just like unnesting the array.


## Now to You!
- With your neighbors, take a look at [this](../activity_data/regex_parsing.csv) CSV file, which contains a simple subset of artists and dimensions from the MoMA data set
- Create the simple table and import in the data from the CSV
- See if you can achieve the following:
  - Create and populate new columns for first, middle, and last name
  - Create and populate new columns to hold the width and height in inches (in `1 1/4` form)
  - If the above was easy and you finish quickly, create new columns to hold decimal equivalents of the width and height


# Full Text Tools
## Full Text Search
- Pattern matching or regular expressions are great for parsing longer text for particular known patterns
- What if you want more general information about the text within a longer passage?
  - We need some different tools
- Using these new tools will require using some new functions and data types:
  - `ts_vector`
  - `ts_query`


## Text To Vector
- English (and other languages) have many _connecting_ words that help convey meaning, intent, or relationships
- These are commonly not what are desired when searching text for particular ideas
- Instead it is useful to focus on _lexemes_
  - A lexeme is an abstract unit of meaning that underlies a set of words
  - RUN: run, runs, ran, running
- Postgres's `to_tsvector` will break a string down into its component lexemes, and keep track of where each occured in the original string
```{.postgresql style='font-size:1em'}
SELECT to_tsvector('I flew back to Salem on Saturday');

>> 'back':3 'flew':2 'saturday':7 'salem':5
```

## Lexeme Queries
- Text that you want to search __through__ will need to be converted to a vector through `to_tsvector`
- Text that you want to search __for__ will be converted using `to_tsquery`
- `to_tsquery` takes a sequence of words with symbols connecting them conveying relationships
  - `&` - and
  - `|` - or
  - `!` - not
  - `<->` - followed by
```postgresql
SELECT to_tsquery('fly & monday')
```

## Combining Vectors and Queries
- To actually complete a full text search, you ask Postgres to look through a `tsvector` object for a particular `tsquery`
- The syntax to do so utilizes the match operator, which is two "at" symbols: `@@`
  ```postgresql
  SELECT |||some_tsvector||| @@ |||some_tsquery|||;
  ```
- Using `@@` is a true/false search, so the query is either found or not
  - This means it can be used for filtering with `WHERE` as well!


## Indexing TSVectors
- Individual `tsvector`s can not be easily ordered, so indexing a column with `tsvector` contents using the normal B-Tree method would not be effective
- For this sort of content, use a _Generalized Inverted Index_ or GIN indexing method instead
  ```postgresql
  CREATE INDEX |||idx_name||| ON |||tab_name||| 
  USING GIN (|||col_name|||);
  ```
- Seriously consider adding your `tsvector` column as an index, as it can significantly speed up these sorts of searches

## Getting More Info
:::smaller
- Sometimes it can be useful to get a bit more information about _where_ a match shows up in the text
- The `ts_headline` function can capture snippets of text around a match and display them
  - `ts_headline` operates on the original text, **not** the tsvector! This will absolutely make it slower, so use it wisely!
- There are a few required parameters and some options for `ts_headline`:
  - the text to search
  - the tsquery to look for
  - Other options appear in an option string:
    - `StartSel`/`StopSel`: the delimiters that will showcase the word
    - `MinWords`/`MaxWords`: the min or max number of words to show around the match
    - `MaxFragments`: a max number of fragments to show if the match occurs multiple times
:::

## Ranking
:::smaller
- Sometimes a query might return many results, such that you would want a method to rank or compare them and only select the "best"
- Ranking by relevancy is vague and tends to be very application specific, but it can be useful
- `ts_rank` will give an arbitrary rank based on how many times your query words appear in the text
  - This might mean that longer texts will always receive a higher rank! You can normalize by the text length by providing an extra numeric code as a third argument (2 or 8 probably best)
- `ts_rank_cd` does similar, but also considers the proximity of searched lexemes
- Both functions take 2 required arguments:
  - a tsvector of the contents to be ranked
  - a tsquery determining how they will be ranked
:::

## Your Turn!
The file [here](../activity_data/alice.sql) contains the SQL commands to generate and populate a simple table `alice` which hold the raw chapter contents of the book: Alice in Wonderland. You will need to set up your own `tsvector` column and index. With yet one more new partner, see if you can use the data to answer the following:

- In what chapters does the "Chesire cat" appear?
- In what chapter does the word "mushroom" appear the most? How many times does it appear?