Document Aggregations and Interfaces

Jed Rembold

July 2, 2025

Announcements

I got some responses back to all of your reflections so far
I’m working on feedback for Milestone 2
- Feedback on the documentation will come next week probably
- Poll about documentation going out tomorrow
I actually have a completed syllabus for you (gasp)!
Milestone 3 due next Tuesday
Don’t forget your weekly reflections today (if you haven’t already for this week)!

Tonight

MongoDB
- Aggregations
- Updates
- Deletions
Working with Documents and JSON
- Airflow’s MongoHook
- Serialization
- DuckDB Unpacking
Work time

Further MongoDB Operations

Querying SubDocuments

We saw last class that when we can query a collection for documents that match various key requirements
What if we wanted to filter on the contents of a sub-document though?

Consider the following document:

{ 
  "name": "Billy Johnson",
  "year": "senior",
  "classes": [
    {"name": "algebra", "grade": "A", "semester": "Fall"},
    {"name": "english", "grade": "C", "semester": "Spring"},
    {"name": "chemistry", "grade": "F", "semester": "Fall"},
    {"name": "spanish", "grade": "A", "semester": "Spring"}
  ]
}

Accessing Subdocuments

We can specify a subdocument in an array through dot notation
For instance:
```
db.|||collection|||.find({'classes.name': 'algebra'})
```
would find all students who took a class named algebra
You can specify location in an array with the same dot notation (0-indexed)
Keep in mind that you are still searching at the document level though!
```
db.|||collection|||.find({'classes.name': 'algebra',
                  'classes.grade': 'A'})
```
would find all documents where a student took a class named algebra and took a class where they got an A. They might not have gotten the A in algebra!

Matching subdocuments with several properties

If we want to find all documents where a student got an A in algebra, we need some different tools
$elemMatch will allow us to specify that a particular element of an array must meet all of the provided requirements

Example:

db.|||collection|||.find({'classes': {
                    '$elemMatch': {
                      'name': 'algebra', 
                      'grade': 'A'
                    } } })

Selecting only subdocuments

It is not difficult to imagine situations where you only care about the sub-documents
Could we extract only those from find?
- Sorta. We could select only the surrounding key ('classes' in our current example)
But that really isn’t the same as getting back a list of all the classes students have taken
To go further, we need to add another tool: MongoDB’s aggregate

MongoDB’s Aggregate

The aggregate function in MongoDB actually does far more than just compute aggregate values
Best to think of it as a series of steps or a pipeline in and of itself
- You give it a series of calculations to compute, and the result of each is used as the input to the next
- Should be very familiar to those of you coming from a strong tidyverse background

Basic syntax will look like:

db.|||collection|||.aggregate([|||list of operations|||])

Aggregate Stages

Each stage of the aggregate pipeline can take a variety of forms. Common forms include:

Stage Name	Description	Example
`"$match"`	Filters documents	`{"$match": {"year": "senior"}}`
`"$project"`	Include/Exclude fields	`{"$project": {"name": 1}`
`"$unwind"`	Flatten arrays	`{"$unwind": "$classes"}`
`"$group"`	Aggregate by group	`{"$group": {"_id": "year"}}`
`"$sort"`	Order the results	`{"$sort": {"name":1}}`
`"$count"`	Count the documents	`{"$count": "name"}`

Actual Aggregates

If using $group, you need to follow a particular syntax:
- "_id" is the field that you want to be grouping by
  - If you want an aggregate over everything (no grouping) set this to null
- a comma separated list of field names and their corresponding aggregate calculation
Common aggregate functions include:

Name	Description	Example
`"$sum"`	Sum numeric values or literals	`{"$sum": 1}`
`"$avg"`	Take the average of numeric fields	`{"$avg": "$fieldname"}`
`"$min"/"$max"`	Find the min or max of a field	`{"$min": "$name"}`

What is a field worth?

You may have noticed that sometimes we put a $ in front of a field name, and sometimes we don’t!
This can be confusing, and early on it is probably worth trying both ways if you are getting some confusing results
In general though, it breaks down to:
- Are you talking about the literal name of the field? Usually for projecting, or creating fields? Then no $ in needed
- Are you talking about the contents of a field? For filtering or comparisons? Then you need the $ in front
The $ basically tells MongoDB that you are looking specifically at the contents OF that field (and the field name itself is otherwise unimportant)

Practice Time!

On the sources.advde server you can connect to the demos using the same login info as the metro_source relational database
Use the students collection to answer the following questions
- What is the most popular class taken?
- What is the average age broken down by letter grade? (The average age of students who got an A vs B vs C, etc?)

Updating Entries

Updating entries follows a very similar pattern to SQL. You need:
- A filter to specify which entries you want to be updating
- A statement to set the new value
You can update one or many documents at once with update_one or update_many

db.collection.update_one(
  {"name": "Amelia Rivera"},
  {"$set": {"age": 17}})

Beyond Set

While "$set" will replace the entire field value, you sometimes want more control.
There are a variety of other ways you could interact with the value of a field:
- "$unset" will just remove the field
- "$inc" will increment a specified numeric field by some amount
- "$rename" will rename the field itself to something different
- "$addToSet" will add a value to an array if it is not already there

Deleting Entries

The same rules apply for deleting entries as in SQL: you need to specify the filter condition
Just like with SQL, I’d highly suggest you test your filter condition in find before accidentally deleting documents you didn’t mean to!
The exists both delete_one and delete_many

db.|||collection|||.delete_many(
  {"age": {"$gte": 18}})

Break Time!

Working with Documents and JSON

Filling the Gaps

There are a variety of smaller tools to be aware of when it comes to working with document data in our pipelines
This next portion is intended to help fill or smooth over some of those little gaps, so that you don’t have to discover them on your own
In particular, we want to ensure that we have an understanding of:
- How to connect to MongoDB from Airflow
- How to snapshot documents/collections
- How to reconcile those documents with the relational data you already have

MongoDB from Airflow

We have been using the pymongo library to interact with MongoDB
This can still work in Airflow, but you’d have to code the connections in yourself
Alternatively, you can import MongoHook from airflow.providers.mongo.hooks.mongo
- This is essentially just a wrapper around pymongo that let’s you connect to a defied MongoDB connection
Either library does not come with the stock Airflow image, so you’ll need to add them to your Dockerfile
- pymongo or
- apache-airflow-providers-mongo

The Mongo Provider

If you go for adding the Mongo providers library, it will also unlock a “mongo” connection type within Airflow
You can define your connection just like you could for the other connections we have used
Important: Under the “Extra Fields” tab, click on the “Use SSL” option and then click it back off.
- It needs to be set to false, but defaults to null (although the toggle looks initially off)
- You can check the “Extra Fields JSON” to enture it is set to false (the other extra values should be null or false)

Hooking the Mongo

To use your newly setup connection, make sure you import MongoHook into your DAG
```
from airflow.providers.mongo.hooks.mongo import MongoHook
```

Then you can connect to the Airflow connection you have made

hook = MongoHook(mongo_conn_id=|||your connection name|||)
client = hook.get_conn()

Once you have the client, everything is as it was when we were just using pymongo
- Get the database you want, then the collection, and start querying away

BSON vs JSON

MongoDB technically uses BSON under the hood
- Stands for Binary JSON
This comes with a few perks:
- Support for some extra types that JSON does not have
- Supports metadata that improves lookup speeds
- Field order is preserved
- Fast, and optimized for quick decoding
In the majority of cases, the transition to JSON is seamless, and so we can treat it as such
- But there are some cases that require some special treatment

Embedded Objects

Sometimes a MongoDB document will contain embedded types
- PyMongo will generally handle converting these to a Python equivalent, but that doesn’t not equate necessarily to a perfect JSON equivalent
A prime example is the "_id" field, which has as a corresponding value some ObjectId() object
- This object will not natively map to JSON without some help
The easiest fix is to import json_util from the bson library:
```
from bson import json_util
```
This gives you the json_util.dumps function, which will take a bson object and convert it to a legitimate JSON string

Working with JSON

Some of you have more or less experience in working with JSON in Python
In general, you want to import json, which gives you access to:
- json.load: reads from an open JSON file to populate a Python dictionary
- json.dump: writes to an open JSON file to write a Python dictionary to disk
- json.loads: reads a string of JSON into a Python dictionary
- json.dumps: writes a Python dictionary to a JSON string

If working directly with files, you need to open the file yourself:

with open(|||filename|||, |||mode|||) as fh:
  data = json.load(fh)
  json.dump(data, fh)

JSON to S3

You can place any time of file in S3 storage, and JSON is probably an obvious choice for document snapshots
- Allows you to keep it in as close to native form as possible
Easiest method is to use the Boto3 client
- Some of you have been using this already, which is fine!
- I have added another function to the utils.py package to help with this called get_minio_boto_client

Usage would look like:

from utils import get_minio_boto_client

s3_client = get_minio_boto_client(|||your minio connection name|||)

Working with the Boto Client

The boto client then gives you methods to directly read or write files from the S3 storage with |||client|||.get_object() and |||client|||.put_object()
Need to provide those methods several parameters:
- Bucket: the name of the bucket you are reading or writing to
- Key: essentially the path/filename of where you are reading or writing to
- Body: the text you are writing (or binary data). Only needed for writing
- ContentType: the mime type of the document (application/json). Optional

If reading from the S3 storage, you will then need to parse out the response object:

resp = client.get_object(|||your info|||)
content = resp["Body"].read().decode("utf-8")

Enriching with JSON

Eventually you will want to be reading the JSON data back out of your data lake to use in populating your data warehouse
Some options available here:
- Read the data into a Python dictionary, and work with it in that format
- Read the data directly into a tabular format using DuckDB
In almost all cases, working with the data inside of DuckDB is going to be massively faster, and also makes it easy to relate to other relational data
- Requires a little (but not huge) amount of work to handle the nested data

DuckDB Structs

Nested data shows up in DuckDB as a struct data type
DuckDB has a whole system in place for working with structs
Perhaps the single most useful function though is UNNEST, which can also take a recursive := true option
- This flattens out the nested content doing one of two things:
  - Unnesting an array? You’ll get new rows added to the table for each element in the array
  - Unnesting a map/dictionary? You’ll get the keys added as new columns

SELECT name, age, UNNEST(classes, recursive := true)
FROM 's3://bucket/myjson.json'

Accessing DuckDB values

Unnesting can be extremely useful if the main content that you want belongs in one of the sub-documents
If you just need to access a bit of information from a sub-document though, you can access it with dot notation:
```
SELECT name, age, classes.name
FROM 's3://bucket/myjson.json
```
Use classic python indexing square brackets to extract enumerated items from arrays
- This is 1-based to agree with other SQL conventions (and thus different from MongoDB’s 0-based)

Your Time!

For Next Week

Milestone 3 is live, and the MongoDB database has been live since Saturday
You have a dashboard due on Tuesday!
If you haven’t gotten snapshots going yet, I’d recommend making sure those are working before you leave tonight
Then you can just worry about the new dimensional modeling and how you will use the document data to enrich the data you already have
Keep an eye out for my poll about how useful the past group’s documentation has been to you