Regular Text

Jed Rembold

Monday, November 3, 2025

Announcements

HW8 posted was posted last week!
- Working with subqueries, dates and times
Trying to get Test 2 results by Wednesday, but might need to wait longer to hand back
Spending this week in class with text mining
Polling: polling.jedrembold.prof

Given the starting table called rev, what is the output of the query?

SELECT
  CASE
    WHEN num % 2 = 0 THEN name
    WHEN name > 'B' THEN 'D'
    ELSE 'A'
  END
FROM rev
WHERE num < 4
ORDER BY num DESC
LIMIT 1

Time to focus on everything we can do with strings!
Chapter topics fall into several main ideas:
- Manipulating strings
- More complicated pattern matching
- Full text searching using normalization and lexemes
All are geared around making using text and strings much more powerful and flexible

Function	Description
`\|\|\|str\|\|\| \|\| \|\|\|str2\|\|\|`	Concatenates string 1 and string 2 together
`upper(\|\|\|str\|\|\|)`	Converts a string to all uppercase characters
`lower(\|\|\|str\|\|\|)`	Converts a string to all lowercase characters
`char_length(\|\|\|str\|\|\|)`	Returns the number of characters in the string
`position(\|\|\|str\|\|\| IN \|\|\|substr\|\|\|)`	Find the number of the character where the substring begins
`trim(\|\|\|opt\|\|\| \|\|\|chr\|\|\| FROM \|\|\|str\|\|\|)`	Removes the given characters from the string, optionally taking from the leading or trailing edge
`substring(\|\|\|str\|\|\| FROM \|\|\|n\|\|\| FOR \|\|\|l\|\|\|)`	Returns the portion of the string starting at position n and continuing for l characters

Function	Description
`initcap(\|\|\|str\|\|\|)`	Converts the first character of each word to uppercase, and the rest lower
`left(\|\|\|str\|\|\|,\|\|\|n\|\|\|)`	Returns the first n characters of the string
`right(\|\|\|str\|\|\|,\|\|\|n\|\|\|)`	Returns the last n characters of the string
`ltrim(\|\|\|str\|\|\|,\|\|\|chr\|\|\|)`	Remove the characters (space by default) from the start of the string
`rtrim(\|\|\|str\|\|\|,\|\|\|chr\|\|\|)`	Remove the characters (space by default) from the end of the string
`replace(\|\|\|str\|\|\|,\|\|\|from\|\|\|,\|\|\|to\|\|\|)`	Replaces all occurance of from in the string to to
`length(\|\|\|str\|\|\|)`	Returns the number of characters in the string
`substr(\|\|\|str\|\|\|, \|\|\|n\|\|\|, \|\|\|l\|\|\|)`	Returns the portion of the string starting at position n and continuing l characters

We’ve already seen basic pattern matching with LIKE and LIKE
- Some flexibility with wildcard characters: % and _
To get (much) more flexibility, we need to pivot to something made for exactly this purpose: regular expressions (or regex)
Regular expressions are a sequence of mostly single character symbols that denote exactly what patterns one could wish for
- These sequences of characters can initially look very inscrutable! Stick with it!
Regex’s are useful all over, and supported in almost all programming languages as well. Learning at least the basics is time very well spent.

Expression	Description
`.`	Matches any character except a new line (this can vary some in other implementations)
`[abc]`	Matches any character in the square brackets (a or b or c)
`[a-z]`	Matches a range of characters (all lowercase letters here)
`[^a-z]`	Caret negates what follows (so no lowercase letters here)
`\w`	Any word character, digit or underscore
`\d`	Any digit
`\s`	A space
`\t`	A tab character
`\n`	A newline character

Expression	Description
`^`	Match at the start of the string
`$`	Match at the end of the string
`?`	Get the preceding match 0 or one time
`*`	Get the preceding match zero or more times
`+`	Get the preceding match one or more times
`{m}`	Get the preceding match exactly m times
`{m,n}`	Get the preceding match between m and n times
`a \| b`	Match on either a or b, where a and b are full matching expressions
`( )`	Create a capture group or set precedence
`(?: )`	Negate reporting a capture group

If you ever want to match off a symbol that has special meaning in regex (a parenthese, for instance) you must escape it with a backslash: \(
Reserved characters include: { } [ ] / \ + * . $ ^ | ?
Flags can be added at the end to tweak matching
- /i means that matches will be case insensitive
- /g means that all instances of the match will be returned, not just the first
- /m allows the anchor characters (^ and $) to operate on each line, not just across the entire string.

The link here has a nice sequence of short problems to test your skills against
Most problems consist of:
- Terms that you want to match correctly
- Terms that you want to not match
- Capture groups that you’d like to capture
Alongside your neighbors, see how many you can figure out in the next 20 minutes
- I want you talking. Only 1 computer allowed for the app and for someone to type on. Another can have the earlier table open.

One of the main ways we previously used pattern matching was for filtering
You can also use regexes for pattern matching!
- ~ is a case sensitive match using the following regex
- ~* is a case insensitive match using the following regex
- Either can have a ! in front to negate the search (where things do not match the regex)

SELECT colname
FROM tablename
WHERE colname ~ '[a-z]*\s\d{2}';

Another hugely common use of regex is to extract only the data you want from a much larger string
This can be particularly useful when cleaning data or constructing useful database tables
regexp_match(str, regex) returns the first matching instance in the string
- What is returned is whatever is in any capture groups you may have included in your regex, or the entire match if there are no capture groups
- Output is returned as an array, to allow for potentially multiple capture groups
- If you just have one capture group and don’t want it in an array, index it out using [1] at the end after wrapping entire expression in ()

SELECT (regexp_match('today is March 15, 2022', '\d{4}'))[1];

In the same pairings as earlier, take a look at this CSV file, which contains a simple subset of artists and dimensions from the MoMA data set
Create the simple table and import in the data from the CSV
See if you can achieve the following:
- Create and populate new columns for first, middle, and last name
- Create and populate new columns to hold the width and height in inches (in 1 1/4 form)
- If feeling good about the above, create new columns to hold decimal equivalents of the width and height