Jed Rembold
January 15, 2025

| Category | Weight |
|---|---|
| Homework | 55% |
| Project | 30% |
| Midterm Video | 15% |
Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists—data engineers— to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure, preparing it for further analysis by data analysts and scientists
“Data Engineering and Its Main Concepts” by AlexSoft
The first type of data engineering is SQL-focused. The work and primary storage of the data is in relational databases. All of the data processing is done with SQL or a SQL-based language. Sometimes, this data processing is done with an ETL tool. The second type of data engineering is Big Data–focused. The work and primary storage of the data is in Big Data technologies like Hadoop, Cassandra, and HBase. All of the data processing is done in Big Data frameworks like MapReduce, Spark, and Flink. While SQL is used, the primary processing is done with programming languages like Java, Scala, and Python.
“The Two Types of Data Engineering” Jesse Anderson
In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so-called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.
“The Rise of the Data Engineer” Maxime Beauchemin
Data engineering is all about the movement, manipulation, and management of data.
“What is Data Engineering?” Lewis Gavin
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.
“Fundamentals of Data Engineering” Joe Reiss & Matt Housley

Data Engineers need to contend with multiple facets of data


| Structured | Semi-Structured | Unstructured | |
|---|---|---|---|
| Ability to search and organize data? | Easy | Moderate | Difficult |
| Format of Data? | Tables of numbers and some text | Tables, lists of numbers and text | Images, videos, detailed text |
| Where is it stored? | Relational databases | NoSQL, XML, JSON | Data Lakes |

| First Name | Last Name | Birthday |
|---|---|---|
| Frank | Stein | 4/2/2000 |
| Tessa | Loch | 8/23/2003 |
| Bobby | Wolf | 12/14/2005 |
As the previous slide may have suggested, you could store the information in some sort of table or comma separate values file:
| First Name | Last Name | Birthday |
|---|---|---|
| Frank | Stein | 4/2/2000 |
| Tessa | Loch | 8/23/2003 |
| Bobby | Wolf | 12/14/2005 |
First Name,Last Name,Birthday
Frank,Stein,4/2/2000
Tessa,Loch,8/23/2003
Bobby,Wolf,12/14/2005
Alternatively, you might use some other form of common data structure like JSON to store the information:
[
{"First Name": "Frank",
"Last Name": "Stein",
"Birthday": "4/2/2000"},
{"First Name": "Tessa",
"Last Name": "Loch",
"Birthday": "8/23/2003"},
{"First Name": "Bobby",
"Last Name": "Wolf",
"Birthday": "12/14/2005"}
]
| First Name | Last Name | Birthday | Class | Class Day | Class Time |
|---|---|---|---|---|---|
| Frank | Stein | 4/2/2000 | CS151 | MWF | 1:00pm |
| Frank | Stein | 4/2/2000 | MATH256 | MWF | 9:00am |
| Tessa | Loch | 8/23/2003 | CS151 | MWF | 1:00pm |
| Tessa | Loch | 8/23/2003 | IDS236 | TTh | 1:00pm |
| Tessa | Loch | 8/23/2003 | HIST123 | MWF | 12:00pm |
| Bobby | Wolf | 12/14/2005 | IDS236 | TTh | 1:00pm |
| Bobby | Wolf | 12/14/2005 | MATH256 | MWF | 9:00am |
[
{"First Name": "Frank", "Last Name": "Stein", "Birthday": "4/2/2000",
"Classes": [
{"class": "CS151", "day": "MWF", "time":"1:00pm"},
{"class": "MATH256", "day": "MWF", "time":"9:00am"}
]},
{"First Name": "Tessa", "Last Name": "Loch", "Birthday": "8/23/2003",
"Classes": [
{"class": "CS151", "day": "MWF", "time":"1:00pm"},
{"class": "IDS236", "day": "TTh", "time":"1:00pm"},
{"class": "HIST123", "day": "MWF", "time":"12:00pm"}
]},
{"First Name": "Bobby", "Last Name": "Wolf", "Birthday": "12/14/2005",
"Classes": [
{"class": "IDS236", "day": "TTh", "time":"1:00pm"},
{"class": "MATH256", "day": "MWF", "time":"9:00am"}
]},
]
| First Name | Last Name | Birthday |
|---|---|---|
| Frank | Stein | 4/2/2000 |
| Tessa | Loch | 8/23/2003 |
| Bobby | Wolf | 12/14/2005 |
| Class | Day | Time |
|---|---|---|
| CS151 | MWF | 1:00pm |
| MATH256 | MWF | 9:00am |
| IDS236 | TTh | 1:00pm |
| HIST123 | MWF | 12:00pm |
| PHYS221 | MWF | 10:00pm |
SQL has commands to help with the administration of the database, as well as creating, manipulating, and querying specific tables within the database
New installs will come with a database already existing (called
postgres), but it is a good idea to create a
new one and leave the default untouched.
Can run SQL commands either by opening the terminal or running the command in a query.
To create a new database, the syntax is:
CREATE DATABASE |||name of database|||;CREATE TABLE |||table name||| (
|||column_name₁||| |||type₁|||,
|||column_name₂||| |||type₂|||,
etc...
);
INSERT INTO |||table name||| (|||column₁|||, |||column₂|||)
VALUES ('row1a', 'row2a'),
('row1b', 'row2b'),
('row1c', 'row2c');
The prime use of SQL for individuals not in charge of managing a database is using it to query the database for information
Queries return only a subset of information from the database, and don’t alter the database in any way
The keyword that begins any query is
SELECT, followed by:
Basic syntax then could look something like:
SELECT |||column₁|||, |||column₂||| FROM |||table name|||;An asterisk *, can stand in for the
column names to stand for “all the columns”
SELECT * FROM |||table name|||;DISTINCT keyword allows you to only
display the unique values that appear in a column
SELECT DISTINCT |||column||| FROM |||table name|||;SELECT DISTINCT |||column₁|||, |||column₂|||
FROM |||table name|||;ORDER BY keyword to
indicate a column to use for orderingASC (the default) or
DESC to specify the direction of
orderingSELECT |||column₁|||, |||column₂|||
FROM |||table name|||
ORDER BY |||column₂||| DESC;
SHOW lc_collate;en_US.utf8ORDER BY statementSELECT |||column|||
FROM |||table name|||
ORDER BY |||column||| DESC COLLATE CEspecially in tables with thousands to millions of rows, you likely don’t want all of the information pertaining to a column
Instead it makes sense to filter or only return rows that meet certain criteria
SQL’s WHERE keyword provides exactly
this functionality:
SELECT |||column||| FROM |||table name|||
WHERE |||some condition|||;Each condition is some sort of comparison check, which could be summarized as a true / false question
| Operator | Function | Example |
|---|---|---|
= |
Equal to | WHERE colname = 5 |
<> or
!= |
Not equal to | WHERE colname != 'fish' |
> |
Greater than | WHERE colname > 0 |
< |
Less than | WHERE colname < 100 |
>= |
Greater than or equal to | WHERE colname >= 0 |
<= |
Less than or equal to | WHERE colname <= 1000 |
BETWEEN |
Within a range | WHERE colname BETWEEN 50 and 100 |
IN |
Matches one of a set | WHERE colname IN ('red', 'blue') |
LIKE or
ILIKE |
Match a pattern | WHERE colname LIKE '%ed' |
NOT |
Negates a condition | WHERE colname NOT LIKE '%ed' |
LIKE and
ILIKE differ only in whether capitalization
matters in the match
LIKE is capitalization sensitive,
ILIKE is not% is a wildcard matching one or more of
any character_ is a wildcard matching just a single
character\_ or
\%)AND and
OR operators that you can use to stitch
together multiple conditionsSELECT *
FROM |||table name|||
WHERE |||column₁||| LIKE 'F%' AND
(|||column₂||| > 50 OR |||column₃||| <= 10);
LIMITSELECT * FROM |||table_name|||
WHERE |||some_condition|||
ORDER BY |||some_column|||
LIMIT 5;
Introduce yourself to a neighbor and then work together to construct a query to solve the following situation: