And So It Begins

Jed Rembold

January 15, 2025

Announcements

  • Welcome to DATA 503: Fundamentals of Data Engineering!
  • Things to do / be aware of:
    • Access the course webpage here
      • Also linked to from Canvas page
    • Read over the full syllabus
    • Procure the book if you haven’t already
    • Join the class Discord channel, to facilitate remote questions, answers, and general communication
      • Invite link in announcement on Canvas
    • Homework will be channeled through GitHub Classroom, so you’ll need to get a GitHub account if you don’t already have one
  • First homework assignment is up!

Who Am I?

Name:
Jed Rembold
Background:
PhD in Physics with specialization in Astrophysics
  • Astronomy is essentially a data science field these days
  • Have been teaching CS/DS now for past 5 years
Office:
Ford 214
Hours:
MW 2:00-4:00, TTh 3:00-4:30, or anytime online or by appointment!
Email:
jjrembold@willamette.edu

Syllabus Stuff

Grade Breakdown

  • The grade breakdown in the class is fairly straightforward
Category Weight
Homework 55%
Project 30%
Midterm Video 15%

Homework

  • Homework due the night before class at midnight
  • Will be submitted through GitHub Classroom
  • Problems will commonly require a mix of written and coded pieces:
    • Both will have either template files that you can add to and re-upload, or clear instructions as to what filename you should save your code or resulting data as.
  • 72 cumulative, unpenalized hours allotted over the semester for late submissions
    • Late work past that point will drop by 20% credit each day
    • Talk to me if something major comes up, and we can make a plan

Project

  • I continue to grow the project in this course, and that trend will continue this semester
  • Plan to incorporate:
    • Data acquisition
    • Data pipelines (ETL)
    • Database modeling
    • Data Serving
  • Projects are linked to partners, though certain parts will be more or less independent
  • At the end of the semester, you and your partner will take the culmination of your efforts and then make a presentation on it

Midterm Video

  • I have done away with exams in this course
    • Oral exams were most effective and great practice for technical interviews, but I can’t do them with 45 students
  • Instead you will prepare a midterm video
    • There will be a list of skills and techniques to showcase
    • You choose a data set that will allow you to demonstrate those skills, and then prepare a video showcasing those skills and explaining what you are doing
    • Videos should be 10-15 minutes.
    • Happens in lieu of homework, so you have a week to work on it. Can practice as much as you want, split it across multiple takes, etc.

Expectations

  • Most of you will not end up being data engineers
  • Trying to teach you “enough to be dangerous”
    • If you end up in a data scientist position, you will almost assuredly need to interface and work with your engineer
    • If you end up in charge of data at a smaller enterprise, your responsibilities may well entail both the work of a data scientist and the work of a data engineer
    • It is always useful in the modern era to understand how to store information

What even is Data Engineering?

Varied opinions!

Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists—data engineers— to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure, preparing it for further analysis by data analysts and scientists

“Data Engineering and Its Main Concepts” by AlexSoft

The first type of data engineering is SQL-focused. The work and primary storage of the data is in relational databases. All of the data processing is done with SQL or a SQL-based language. Sometimes, this data processing is done with an ETL tool. The second type of data engineering is Big Data–focused. The work and primary storage of the data is in Big Data technologies like Hadoop, Cassandra, and HBase. All of the data processing is done in Big Data frameworks like MapReduce, Spark, and Flink. While SQL is used, the primary processing is done with programming languages like Java, Scala, and Python.

“The Two Types of Data Engineering” Jesse Anderson

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so-called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

“The Rise of the Data Engineer” Maxime Beauchemin

Data engineering is all about the movement, manipulation, and management of data.

“What is Data Engineering?” Lewis Gavin

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.

“Fundamentals of Data Engineering” Joe Reiss & Matt Housley

The Data Engineer’s Role

  • Data Engineers are responsible for ensuring that the right people can find and access the data they need as efficiently as possible
    • Who are the right people? Usually Data Scientists / Analysts
  • Data Engineers are focused on the task of data collection, organization, and preparation
  • The tasks of the Data Scientists are impossible without the Data Engineer!

The Five V’s

  • Data Engineers need to contend with multiple facets of data

    Volume:
    the quantity of data points
    Variety:
    the type and nature of the data (text, images, video, audio)
    Velocity:
    the speed at which new data is generated and processed
    Veracity:
    how trustworthy the sources are
    Value:
    how actionable the data is

The Data Software Landscape 2020

The Data Software Landscape 2024

Sampling the Landscape

  • Clearly, I neither know all of those nor could teach them in a semester
  • My goal for you in this class is three-fold:
    • Understand the big picture
      • Where do many of these pieces of software fit in a pipeline or day-to-day work?
    • Understand what some options are in common use-case categories
      • What is out there for databases?
      • What is out there for data ingestion?
    • Gain experience using a few representative pieces of software

Alternative Storage Structures

Structured Semi-Structured Unstructured
Ability to search and organize data? Easy Moderate Difficult
Format of Data? Tables of numbers and some text Tables, lists of numbers and text Images, videos, detailed text
Where is it stored? Relational databases NoSQL, XML, JSON Data Lakes

Languages to Know?

SQL
The most common interface for databases and data lakes
Python
The bridge language between data engineering and data science
JVM languages (Java or Scala)
Prevalent for common open source Apache projects like Spark and Hive
Bash
The default text interface for Linux operating systems

The Data Engineering Lifecycle

  • We are going to be largely concerned with the area between generation and final usage
  • This three-step process is commonly given the name ETL, standing for Extract, Transform, Load

When to Organize

  • There tend to be two different paradigms for when to run ETL operations
  • Batch processing
    • ETL operations are run on a particular interval, and all the most recent data is processed at once as a single “batch”
    • Can take a while to run, depending on the size of the batch
  • Stream processing
    • ETL operations are run on the data as it is generated, in real-time
    • Speed depends on how fast new data is being generated

Group Activity

  • Use LinkedIn and perform a job search on “Data Engineer” jobs in a US metro area of your choice. Identify 3 returned job listings per group member to investigate in detail (so a group of 3 should look at 9 total postings). Across your 3 listings, identify the following:
    • Goals/objectives/purpose/responsibilities of the position
    • Qualifications required / preferred
    • Skills desired
  • Spend 15 minutes on independent investigation and then 15 minutes on group synthesis
    • Produce a concise group-wide bullet-point list summarizing a “competitive applicant” in the field of Data Engineering based on the above 3 dimensions
    • We’ll summarize at the end what people found

Break

Dinner Break!

  • Take 30 minutes to relax, grab a bite, and stretch your legs!

Why Relational Databases?

How to store data?

  • There are many different ways information can be stored, with varying trade-offs!
  • Suppose you wanted to keep track of your friend’s birthdays
First Name Last Name Birthday
Frank Stein 4/2/2000
Tessa Loch 8/23/2003
Bobby Wolf 12/14/2005

Options!

As the previous slide may have suggested, you could store the information in some sort of table or comma separate values file:

First Name Last Name Birthday
Frank Stein 4/2/2000
Tessa Loch 8/23/2003
Bobby Wolf 12/14/2005
First Name,Last Name,Birthday
Frank,Stein,4/2/2000
Tessa,Loch,8/23/2003
Bobby,Wolf,12/14/2005

Alternatively, you might use some other form of common data structure like JSON to store the information:

[
  {"First Name": "Frank",
   "Last Name": "Stein",
   "Birthday": "4/2/2000"},
  {"First Name": "Tessa",
   "Last Name": "Loch",
   "Birthday": "8/23/2003"},
  {"First Name": "Bobby",
   "Last Name": "Wolf",
   "Birthday": "12/14/2005"}
]

The Plot Thickens…

  • Suppose now you’d also like to keep track of what courses they are currently taking, and what times those courses are held
  • Suppose each is taking 2-3 classes, some of which overlap
  • This significantly complicates both example storage methods!

Example Table Storage

  • We can’t store tables inside of tables, so we usually need to duplicate information over multiple rows:
First Name Last Name Birthday Class Class Day Class Time
Frank Stein 4/2/2000 CS151 MWF 1:00pm
Frank Stein 4/2/2000 MATH256 MWF 9:00am
Tessa Loch 8/23/2003 CS151 MWF 1:00pm
Tessa Loch 8/23/2003 IDS236 TTh 1:00pm
Tessa Loch 8/23/2003 HIST123 MWF 12:00pm
Bobby Wolf 12/14/2005 IDS236 TTh 1:00pm
Bobby Wolf 12/14/2005 MATH256 MWF 9:00am
  • Duplication is generally bad!

Example JSON Storage

  • We still have duplication issues with JSON as well.
[
  {"First Name": "Frank", "Last Name": "Stein", "Birthday": "4/2/2000",
   "Classes": [
     {"class": "CS151", "day": "MWF", "time":"1:00pm"},
     {"class": "MATH256", "day": "MWF", "time":"9:00am"}
    ]},
  {"First Name": "Tessa", "Last Name": "Loch", "Birthday": "8/23/2003",
   "Classes": [
     {"class": "CS151", "day": "MWF", "time":"1:00pm"},
     {"class": "IDS236", "day": "TTh", "time":"1:00pm"},
     {"class": "HIST123", "day": "MWF", "time":"12:00pm"}
    ]},
  {"First Name": "Bobby", "Last Name": "Wolf", "Birthday": "12/14/2005",
   "Classes": [
     {"class": "IDS236", "day": "TTh", "time":"1:00pm"},
     {"class": "MATH256", "day": "MWF", "time":"9:00am"}
    ]},
]

Relational Tables

  • One solution to this issue is realizing that we are trying to actually keep track of two things: friends and classes.
  • So we break things up into two tables, and then create relationships between them
  • This is the core of what occurs in a relational database!
First Name Last Name Birthday
Frank Stein 4/2/2000
Tessa Loch 8/23/2003
Bobby Wolf 12/14/2005
Class Day Time
CS151 MWF 1:00pm
MATH256 MWF 9:00am
IDS236 TTh 1:00pm
HIST123 MWF 12:00pm
PHYS221 MWF 10:00pm

Viewing Relational Databases

  • In general, you’d probably use a third table to represent all the linkages
  • Unique id keys are used to connect the different tables
Relational Database Schema

Relational Databases and SQL

  • SQL is a language that allows you to define and query relational databases
  • These days it is pronounced “ESS-CUE-ELL”, though you may still hear it occasionally referred to in an older form of “SEQUEL”
  • Does not technically stand for “Structured Query Language”!
    • Is not structured in a programming sense
    • Does far more than just query
    • Is not Turing complete in a language sense
  • Comes in several variants, though the core standards are governed by ANSI and ISO, so none stray too far from the standards
    • We’ll be focusing on a variant called PostgreSQL this semester

SQL Basics

SQL Servers

  • PostgreSQL (Postgres for short) operates on a server model, where clients contact the server and ask it to manipulate or query a particular database
    • Multiple databases can exist on the server at a time
  • Works well for large distributions, but you can also just run a local server on your computer, where you are the only client that can connect
  • Several ways you can interact with the server once it is running
    • From a terminal prompt (Ch 18)
    • Using pgAdmin as detailed in the text
    • Using an alternative database client like Beekeeper Studio (Community Edition)
      • This is my absolute favorite SQL client at the moment, and I’d highly recommend it. It is what I’ll showcase much of the semester

Adding a New Database

  • SQL has commands to help with the administration of the database, as well as creating, manipulating, and querying specific tables within the database

  • New installs will come with a database already existing (called postgres), but it is a good idea to create a new one and leave the default untouched.

  • Can run SQL commands either by opening the terminal or running the command in a query.

    • The commands look the same regardless
  • To create a new database, the syntax is:

    CREATE DATABASE |||name of database|||;

Adding a new Table

  • Creating tables is one of the more fundamental actions you may need to take with a database
  • Need to specify several things upon creation
    • The name of the table
    • The names of the columns and associated data types
CREATE TABLE |||table name||| (
    |||column_name₁||| |||type₁|||,
    |||column_name₂||| |||type₂|||,
    etc...
);

A Note on Syntax

  • SQL in general requires no special formatting in the form of capitalization or tabbing
  • Following some conventions can help make your commands easier to read though:
    • Use uppercase for SQL keywords
    • Use lowercase and underscores for table or column names
    • Indent clauses and blocks of code for ease of readability
  • A semicolon does indicate the end of a command though
  • Text and dates will need single quotes, numbers do not

Adding Values to a Table

  • Your tables are initially empty!
  • You add data to the table by inserting new values into the columns
  • New rows are concatenated to the end of the table
  • Need to include an entry for each column (rows can’t be unequal in length)
INSERT INTO |||table name||| (|||column₁|||, |||column₂|||)
VALUES ('row1a', 'row2a'),
       ('row1b', 'row2b'),
       ('row1c', 'row2c');

Asking Questions

Query Time

  • The prime use of SQL for individuals not in charge of managing a database is using it to query the database for information

  • Queries return only a subset of information from the database, and don’t alter the database in any way

  • The keyword that begins any query is SELECT, followed by:

    • The column name(s) you would like information from, and
    • The table you want to get the information from
  • Basic syntax then could look something like:

    SELECT |||column₁|||, |||column₂||| FROM |||table name|||;
  • An asterisk *, can stand in for the column names to stand for “all the columns”

    SELECT * FROM |||table name|||;

You are Distinct

  • In many instances, a particular column may have duplicate entries
    • Each row should be unique, but if looking at just a single column, there very well could be duplicates
  • The DISTINCT keyword allows you to only display the unique values that appear in a column
    • Appears before the names of the desired columns
    SELECT DISTINCT |||column||| FROM |||table name|||;
  • Can be applied to multiple columns at a time
    • Then interpreted as “For each X in the table, what are all the Y values?”
    SELECT DISTINCT |||column₁|||, |||column₂|||
    FROM |||table name|||;

Reversing Entropy

  • Sometimes patterns can be more evident if some ordering is applied to the query results
    • The ordering only applies to the query, the original data is untouched
  • If you do not specify an ordering, you have no guarantee about what order results will come out in!
  • SQL provides the ORDER BY keyword to indicate a column to use for ordering
  • Can use ASC (the default) or DESC to specify the direction of ordering
  • Can also order by multiple columns (separate by comma)
SELECT |||column₁|||, |||column₂|||
FROM |||table name|||
ORDER BY |||column₂||| DESC;

Ordering Basics

  • When your database was created, Postgres assigned it a collation method based on your systems locale
    • You can see the provided collation method for a connected database using: SHOW lc_collate;
  • This collation method is what determines what characters “come before” other characters
    • On my system, the collation method looks like: en_US.utf8
  • If your system is using a different collation method, then your sorting might come out slightly different!
    • In particular, be aware of this on MacOS
    • You can choose a different collation method by providing one after the ORDER BY statement
    SELECT |||column|||
    FROM |||table name|||
    ORDER BY |||column||| DESC COLLATE C

Filtration

  • Especially in tables with thousands to millions of rows, you likely don’t want all of the information pertaining to a column

  • Instead it makes sense to filter or only return rows that meet certain criteria

  • SQL’s WHERE keyword provides exactly this functionality:

    SELECT |||column||| FROM |||table name|||
    WHERE |||some condition|||;
  • Each condition is some sort of comparison check, which could be summarized as a true / false question

Comparison Operations

Operator Function Example
= Equal to WHERE colname = 5
<> or != Not equal to WHERE colname != 'fish'
> Greater than WHERE colname > 0
< Less than WHERE colname < 100
>= Greater than or equal to WHERE colname >= 0
<= Less than or equal to WHERE colname <= 1000
BETWEEN Within a range WHERE colname BETWEEN 50 and 100
IN Matches one of a set WHERE colname IN ('red', 'blue')
LIKE or ILIKE Match a pattern WHERE colname LIKE '%ed'
NOT Negates a condition WHERE colname NOT LIKE '%ed'

Pattern Matching

  • LIKE and ILIKE differ only in whether capitalization matters in the match
    • LIKE is capitalization sensitive, ILIKE is not
  • Each can use several special characters in the desired pattern
    • % is a wildcard matching one or more of any character
    • _ is a wildcard matching just a single character
    • If you want to actually match off one of these characters, you need to “escape” it with a backslash (\_ or \%)

Conditional Combinations

  • Often, you might need to filter on several or more complicated conditions
  • SQL also has the boolean AND and OR operators that you can use to stitch together multiple conditions
  • Order of operations is left to right, so surround terms in parentheses if you need them evaluated earlier
SELECT *
FROM |||table name|||
WHERE |||column₁||| LIKE 'F%' AND
      (|||column₂||| > 50 OR |||column₃||| <= 10);

Reaching Your Limit

  • Often times, especially if sorting, you may not even be interested in all the results
  • Maybe you just want the first few, or even just the first result
  • You can control the maximum number of rows returned using LIMIT
  • Comes last, after all the other keywords
SELECT * FROM |||table_name|||
WHERE |||some_condition|||
ORDER BY |||some_column|||
LIMIT 5;

Example

Introduce yourself to a neighbor and then work together to construct a query to solve the following situation:

  • Suppose you are particularly short and thus can only access cereals on the first two shelves. You’d like to grab the cereal with “Cheerios” somewhere in its name that has the highest amount of sugar per serving. Construct a query that could help you figure out what cereal you should be grabbing!
// reveal.js plugins // Added plugins