The Right Type

Jed Rembold

Wednesday, September 3, 2025

Announcements

HW1 is due at the end of the day today
- You just need to have uploaded your solutions back to the GitHub repository. Nothing else needs to be done.
HW2 will be posted by the end of the day, due next Wednesday night
Look over Ch 5 for next Monday
Polling: polling.jedrembold.prof

Review Question

Suppose a table had been created using the SQL statement below. How would you then query for the most recently measured number of positive cases from schools beginning with “University”?

CREATE TABLE cases (
  id INT,
  school_name VARCHAR(50),
  case_positive INT,
  tests_run INT,
  date_taken DATE
);

SELECT case_positive FROM cases
WHERE school_name LIKE 'University_'
ORDER BY date_taken;

SELECT case_positive FROM cases
WHERE school_name LIKE 'University%'
ORDER BY date_taken DESC;

SELECT case_positive FROM cases
WHERE school_name LIKE '%University%'
ORDER BY case_positive DESC, 
         date_taken DESC;

SELECT case_positive FROM cases
WHERE school_name LIKE '%University%'
ORDER BY date_taken DESC;

Data Type Fundamentals

Each column in a table can only have data of a single type
- Or be missing an entry, designated with a special NULL entry
Assigning the correct data types of columns can
- Improve memory allocation
- Improve performance
- Prevent errors in data entry
- Prevent mistakes in calculations
Postgres can accept a large number of data types, but today we will focus in depth on the most common: characters, numbers, and dates and times

Text

Characters or Text

One of the more common things to be stored are characters or sequences of characters
Postgres has several types you can use to store this sort of information:
- CHAR(n): A fixed length column of n characters. Always uses this amount of characters, padding with spaces if your stored sequence of characters is shorter.
  - Omitting the (n) is the same as CHAR(1)
- VARCHAR(n): A variable length column with a maximum of n characters. If your stored sequence is smaller, then no padding is done and the data is stored as is.
  - Omitting the (n) is the same as TEXT
- TEXT★: A variable length column of “unlimited length” (about 1 GB).
In Postgres (unlike in some other variants) there is not really a difference in performance between all 3 types

★ – Postgres specific, though similar implementations exist in other variants

Character Takeaways

In general, don’t use CHAR unless what you are storing is always a given length of characters
- State abbreviations for instance, or possibly other single character encodings
Use VARCHAR or TEXT in most situations
- VARCHAR is more portable, but make sure to choose a max that is well above whatever your longest sequence of characters could be
- Can also choose VARCHAR if you want an error to be output if too many characters are entered
- If you don’t mind being Postgres specific, TEXT should always work

Numbers

Types of numbers

Important to be able to do built-in math calculations on columns
If the information you want to store is numeric, always use a numeric data type rather than the characters of a number
A few different options for numbers:
- Integers: whole numbers (positive and negative)
- Fixed-point and Floating-point: fractions of whole numbers (also positive and negative)
For each there are several data types that you can choose from, largely depending on the size of the numbers you are expecting.

Integers

Probably the most common form of number you will use in a database
Can use 3 different data types in SQL to represent an integer
- Differ just in the size of the numbers they can hold
Type Size Range

SMALLINT 2 bytes ±32,767

INTEGER 4 bytes ±2,147,483,648

BIGINT 8 bytes ±9,223,372,036,854,775,808
SMALLINT is generally only used if disk space is at a premium
Probably only use BIGINT if sure that INTEGER is insufficient

Type	Size	Range
`SMALLINT`	2 bytes	±32,767
`INTEGER`	4 bytes	±2,147,483,648
`BIGINT`	8 bytes	±9,223,372,036,854,775,808

Serial Integers

If you want a column to be auto-incrementing, you can use the serial form of the integer types:
- SMALLSERIAL
- SERIAL
- BIGSERIAL
Have the same allowed ranges as their integer counterparts
No need to specify that column when adding a row: Postgres will automatically increment and add it
While each increment will be unique, you may not have perfect even intervals
- Removed row values are not reused
- If an INSERT is interrupted because of an error, that unique integer is basically skipped

Your Identity

The SERIAL datatypes are Postgres specific
Since Postgres 10 the SQL core version with IDENTITY has been supported
Automatically fill the column with the incremented number:
```
INTEGER GENERATED ALWAYS AS IDENTITY
```
Automatically fill column but allow overwriting:
```
INTEGER GENERATED BY DEFAULT AS IDENTITY
```

Fixed-Point Numbers

Represents a fractional value
Two methods of writing:
- NUMERIC(|||precision|||, |||scale|||)
- DECIMAL(|||precision|||, |||scale|||)
precision is a positive integer representing the max total number of digits comprising the number (on both sides of the decimal point)
- If provided, needs to be between 1 and 1000
scale is a positive integer representing the total number of digits to the right of the decimal point
Will round or pad the decimal digits with 0’s if needed to get the correct number of digits after the decimal
Will return an error if the input number can’t fit within the precision

Fixed-Point Examples

Consider the number: 192.837465
In various fixed-point formats:
- NUMERIC(5,2) = 192.84
- NUMERIC(10,2) = 192.84
- NUMERIC(15,10) = 192.8374650000
- NUMERIC(6,3) = 192.837
- NUMERIC(4,2) = ERROR

If you don’t include inputs:
- NUMERIC(5) = 193
  - Treats the scale as 0
- NUMERIC = 192.837465
  - Will store up to the maximum precision and scale allowed (131,072 digits before and 16,383 digits after the decimal, lol)

Floating-Point Numbers

Still represents a fractional value
Differs from Fixed-Point in that they use exponents to store the information, so the decimal point could “float”
Two data types that can be used:
- REAL
- DOUBLE PRECISION
They differ just in their precision (and how much storage space they take)
- REAL will show at most 9 significant digits and takes up 4 bytes
- DOUBLE PRECISION will show at most 17 significant digits and takes up 8 bytes
  - Postgres will also accept the type FLOAT to be synonymous with DOUBLE PRECISION

Floating Pitfalls

Float-point representations are computed using binary math to make them easy to store in the computer
Some fractional values cannot be perfectly captured with a binary decimal though
- In the same way that we can’t exactly write 1/3 as a decimal
Postgres will use the best approximation that it can given the precision, but it is still an approximation!
Takeaways:
- Do not use floating-point values if you will be doing highly sensitive calculations with the numbers!
  - i.e. money, landing a rocket on the Moon
- Comparing two float-point values for equality might not work as expected, due to these tiny approximations
  - Better to check if greater or less than, or within a small interval around the desired value

Understanding Check!

A table is made as shown to the right. Only one of the rows below could be added to that table using INSERT INTO without causing an error. Which row could be successfully added?

CREATE TABLE revq (
  col1 CHAR(2),
  col2 TEXT,
  col3 SMALLINT,
  col4 NUMERIC(3,2),
  col5 REAL
);

Answer	col1	col2	col3	col4	col5
A	‘OR’	‘Hello!’	30200	12.2	3.14
B	‘12’	‘How odd!’	43210	2.23	2.718281828
C	‘C’	‘Puzzling.’	13.6	1.1	1.41
D	‘okay’	‘Mysterious…’	2132	8	1.618

It’s a Date (and time)

Dates and Times

One very nice aspect of all SQL databases is that they can work with times and dates very easily, assuming the correct data type is used
- This shouldn’t be trivialized! Working with times and dates is often a pain given all the complexities of the Gregorian calendar, time zones, daylight savings time, etc.
Approximately 4 major time and date data types, summarized in the below table

Name	Storage Size	Lowest	Highest	Resolution
`DATE`	4 bytes	4713 BC	5874897 AD	1 day
`TIME`	8 bytes	00:00:00	24:00:00	1 μs
`TIMESTAMP`	8 bytes	4713 BC	294276 AD	1 μs
`INTERVAL`	16 bytes	-178000000 yrs	-178000000 yrs	1 μs

DATES

Holds information for a single day
- No information about time
Data needs to be enclosed in single quotes, like text strings
Postgres can actually correctly parse a ridiculous number of ways of writing the date, but you can never go wrong with the ISO standard of YYYY-MM-DD
Other acceptable variants include:

January 18, 2022	1/18/2022	01/18/22
2022-Jan-18	18-Jan-2022	Jan-18-22
20220118	Jan 18, 22

Times

Holds information about a single time
- No information about date is stored
- A date CAN be included, but it is largely ignored
Still needs to be enclosed in single quotes
Has variants that are both with and without timezone
- TIME is technically without timezone
- TIME WITH TIME ZONE is, as expected, including a timezone
  - Postgres has a shorthand notation for this, called TIMETZ★
  - Due to the way timezones are stored and some issues that arise, using this data type is generally discouraged.

More times

ISO time formats looks like HH:MM:SS.FFFF, where the hours are on the 24 hour clock
Other formats are accepted, including

16:05:06	16:05	04:05 PM
4:05:06.123 PM 0	4:05:06 AM

Timezones

If specifying a timezone, it comes after the time and there are a few ways to describe it:
- Abbreviation: PST (for Pacific Standard Time)
- Full name: America/Los_Angeles
  - This method requires you to enter the date as well, so that it can tell if daylight savings is active or not!
- UTC Offset: -8 (our clocks are currently 8 hours behind GMT)
If no timezone is specified, but the field requires one, the system timezone is used
Examples of times with time zones:
- 04:05:06 PM PST
- 2022-01-18 16:05:06 America/Los_Angeles
- 16:05:06-8

Timestamps

A TIMESTAMP field holds basically both a date and a time (in that order)
The same properties and formats that apply to each individually apply here as well
Also have the option of with or without a timezone
- TIMESTAMP by itself is without a timezone, as per SQL standards
- TIMESTAMP WITH TIME ZONE is with a timezone. Shorthand in Postgres is TIMESTAMPTZ★
Internally, timestamps with a timezone are always stored in UTC, but then when displayed are converted back to the local timezone
Examples:
- 2022-01-18 14:30:00
- Jan-18-2022 2:30 PM PST

Intervals

Represents a span of time
Generally given by first a number and then by a unit
- Possible units: microsecond, millisecond, second, minute, hour, day, week, month, year, decade, century, millennium
- Abbreviations or plurals of the above also work
Examples:
- 1 day
- 3 century 2 mins
- 45 ms
- 1 mon 87 us
Intervals can be used in calculations with other date/time data types
- Can also be used in comparisons if subtracting timestamps

MORE!

These are just a small sample of some of the most common data types
Postgres supports many more!
- Booleans
- Geometric types
- Network address types
- JSON types
- A full list can be found here
We’ll address others as they come up sporadically through the rest of the semester, but you have the core basics now

Mix and Match

Conversions

There are plenty of instances where you may need to convert between types
- Sometimes calculations (coming soon!) need a certain data type
- Sometimes the information was just not stored in the most ideal type
Can use the CAST function to convert between data types (where it makes sense)
Core syntax is
```
CAST(col_name AS new_data_type)
```
Postgres has a shorthand conversion syntax that uses double colons:
```
col_name::new_data_type
```