Document Aggregations and Interfaces

Jed Rembold

July 2, 2025

Announcements

  • I got some responses back to all of your reflections so far
  • I’m working on feedback for Milestone 2
    • Feedback on the documentation will come next week probably
    • Poll about documentation going out tomorrow
  • I actually have a completed syllabus for you (gasp)!
  • Milestone 3 due next Tuesday
  • Don’t forget your weekly reflections today (if you haven’t already for this week)!

Tonight

  • MongoDB
    • Aggregations
    • Updates
    • Deletions
  • Working with Documents and JSON
    • Airflow’s MongoHook
    • Serialization
    • DuckDB Unpacking
  • Work time

Further MongoDB Operations

Querying SubDocuments

  • We saw last class that when we can query a collection for documents that match various key requirements

  • What if we wanted to filter on the contents of a sub-document though?

  • Consider the following document:

    { 
      "name": "Billy Johnson",
      "year": "senior",
      "classes": [
        {"name": "algebra", "grade": "A", "semester": "Fall"},
        {"name": "english", "grade": "C", "semester": "Spring"},
        {"name": "chemistry", "grade": "F", "semester": "Fall"},
        {"name": "spanish", "grade": "A", "semester": "Spring"}
      ]
    }

Accessing Subdocuments

  • We can specify a subdocument in an array through dot notation

  • For instance:

    db.|||collection|||.find({'classes.name': 'algebra'})

    would find all students who took a class named algebra

  • You can specify location in an array with the same dot notation (0-indexed)

  • Keep in mind that you are still searching at the document level though!

    db.|||collection|||.find({'classes.name': 'algebra',
                      'classes.grade': 'A'})

    would find all documents where a student took a class named algebra and took a class where they got an A. They might not have gotten the A in algebra!

Matching subdocuments with several properties

  • If we want to find all documents where a student got an A in algebra, we need some different tools

  • $elemMatch will allow us to specify that a particular element of an array must meet all of the provided requirements

  • Example:

    db.|||collection|||.find({'classes': {
                        '$elemMatch': {
                          'name': 'algebra', 
                          'grade': 'A'
                        } } })

Selecting only subdocuments

  • It is not difficult to imagine situations where you only care about the sub-documents
  • Could we extract only those from find?
    • Sorta. We could select only the surrounding key ('classes' in our current example)
  • But that really isn’t the same as getting back a list of all the classes students have taken
  • To go further, we need to add another tool: MongoDB’s aggregate

MongoDB’s Aggregate

  • The aggregate function in MongoDB actually does far more than just compute aggregate values

  • Best to think of it as a series of steps or a pipeline in and of itself

    • You give it a series of calculations to compute, and the result of each is used as the input to the next
    • Should be very familiar to those of you coming from a strong tidyverse background
  • Basic syntax will look like:

    db.|||collection|||.aggregate([|||list of operations|||])

Aggregate Stages

  • Each stage of the aggregate pipeline can take a variety of forms. Common forms include:
Stage Name Description Example
"$match" Filters documents {"$match": {"year": "senior"}}
"$project" Include/Exclude fields {"$project": {"name": 1}
"$unwind" Flatten arrays {"$unwind": "$classes"}
"$group" Aggregate by group {"$group": {"_id": "year"}}
"$sort" Order the results {"$sort": {"name":1}}
"$count" Count the documents {"$count": "name"}

Actual Aggregates

  • If using $group, you need to follow a particular syntax:
    • "_id" is the field that you want to be grouping by
      • If you want an aggregate over everything (no grouping) set this to null
    • a comma separated list of field names and their corresponding aggregate calculation
  • Common aggregate functions include:
Name Description Example
"$sum" Sum numeric values or literals {"$sum": 1}
"$avg" Take the average of numeric fields {"$avg": "$fieldname"}
"$min"/"$max" Find the min or max of a field {"$min": "$name"}

What is a field worth?

  • You may have noticed that sometimes we put a $ in front of a field name, and sometimes we don’t!
  • This can be confusing, and early on it is probably worth trying both ways if you are getting some confusing results
  • In general though, it breaks down to:
    • Are you talking about the literal name of the field? Usually for projecting, or creating fields? Then no $ in needed
    • Are you talking about the contents of a field? For filtering or comparisons? Then you need the $ in front
  • The $ basically tells MongoDB that you are looking specifically at the contents OF that field (and the field name itself is otherwise unimportant)

Practice Time!

  • On the sources.advde server you can connect to the demos using the same login info as the metro_source relational database
  • Use the students collection to answer the following questions
    • What is the most popular class taken?
    • What is the average age broken down by letter grade? (The average age of students who got an A vs B vs C, etc?)

Updating Entries

  • Updating entries follows a very similar pattern to SQL. You need:
    • A filter to specify which entries you want to be updating
    • A statement to set the new value
  • You can update one or many documents at once with update_one or update_many
db.collection.update_one(
  {"name": "Amelia Rivera"},
  {"$set": {"age": 17}})

Beyond Set

  • While "$set" will replace the entire field value, you sometimes want more control.
  • There are a variety of other ways you could interact with the value of a field:
    • "$unset" will just remove the field
    • "$inc" will increment a specified numeric field by some amount
    • "$rename" will rename the field itself to something different
    • "$addToSet" will add a value to an array if it is not already there

Deleting Entries

  • The same rules apply for deleting entries as in SQL: you need to specify the filter condition
  • Just like with SQL, I’d highly suggest you test your filter condition in find before accidentally deleting documents you didn’t mean to!
  • The exists both delete_one and delete_many
db.|||collection|||.delete_many(
  {"age": {"$gte": 18}})

Break Time!

Working with Documents and JSON

Filling the Gaps

  • There are a variety of smaller tools to be aware of when it comes to working with document data in our pipelines
  • This next portion is intended to help fill or smooth over some of those little gaps, so that you don’t have to discover them on your own
  • In particular, we want to ensure that we have an understanding of:
    • How to connect to MongoDB from Airflow
    • How to snapshot documents/collections
    • How to reconcile those documents with the relational data you already have

MongoDB from Airflow

  • We have been using the pymongo library to interact with MongoDB
  • This can still work in Airflow, but you’d have to code the connections in yourself
  • Alternatively, you can import MongoHook from airflow.providers.mongo.hooks.mongo
    • This is essentially just a wrapper around pymongo that let’s you connect to a defied MongoDB connection
  • Either library does not come with the stock Airflow image, so you’ll need to add them to your Dockerfile
    • pymongo or
    • apache-airflow-providers-mongo

The Mongo Provider

  • If you go for adding the Mongo providers library, it will also unlock a “mongo” connection type within Airflow
  • You can define your connection just like you could for the other connections we have used
  • Important: Under the “Extra Fields” tab, click on the “Use SSL” option and then click it back off.
    • It needs to be set to false, but defaults to null (although the toggle looks initially off)
    • You can check the “Extra Fields JSON” to enture it is set to false (the other extra values should be null or false)

Hooking the Mongo

  • To use your newly setup connection, make sure you import MongoHook into your DAG

    from airflow.providers.mongo.hooks.mongo import MongoHook
  • Then you can connect to the Airflow connection you have made

    hook = MongoHook(mongo_conn_id=|||your connection name|||)
    client = hook.get_conn()
  • Once you have the client, everything is as it was when we were just using pymongo

    • Get the database you want, then the collection, and start querying away

BSON vs JSON

  • MongoDB technically uses BSON under the hood
    • Stands for Binary JSON
  • This comes with a few perks:
    • Support for some extra types that JSON does not have
    • Supports metadata that improves lookup speeds
    • Field order is preserved
    • Fast, and optimized for quick decoding
  • In the majority of cases, the transition to JSON is seamless, and so we can treat it as such
    • But there are some cases that require some special treatment

Embedded Objects

  • Sometimes a MongoDB document will contain embedded types

    • PyMongo will generally handle converting these to a Python equivalent, but that doesn’t not equate necessarily to a perfect JSON equivalent
  • A prime example is the "_id" field, which has as a corresponding value some ObjectId() object

    • This object will not natively map to JSON without some help
  • The easiest fix is to import json_util from the bson library:

    from bson import json_util
  • This gives you the json_util.dumps function, which will take a bson object and convert it to a legitimate JSON string

Working with JSON

  • Some of you have more or less experience in working with JSON in Python

  • In general, you want to import json, which gives you access to:

    • json.load: reads from an open JSON file to populate a Python dictionary
    • json.dump: writes to an open JSON file to write a Python dictionary to disk
    • json.loads: reads a string of JSON into a Python dictionary
    • json.dumps: writes a Python dictionary to a JSON string
  • If working directly with files, you need to open the file yourself:

    with open(|||filename|||, |||mode|||) as fh:
      data = json.load(fh)
      json.dump(data, fh)

JSON to S3

  • You can place any time of file in S3 storage, and JSON is probably an obvious choice for document snapshots

    • Allows you to keep it in as close to native form as possible
  • Easiest method is to use the Boto3 client

    • Some of you have been using this already, which is fine!
    • I have added another function to the utils.py package to help with this called get_minio_boto_client
  • Usage would look like:

    from utils import get_minio_boto_client
    
    s3_client = get_minio_boto_client(|||your minio connection name|||)

Working with the Boto Client

  • The boto client then gives you methods to directly read or write files from the S3 storage with |||client|||.get_object() and |||client|||.put_object()

  • Need to provide those methods several parameters:

    • Bucket: the name of the bucket you are reading or writing to
    • Key: essentially the path/filename of where you are reading or writing to
    • Body: the text you are writing (or binary data). Only needed for writing
    • ContentType: the mime type of the document (application/json). Optional
  • If reading from the S3 storage, you will then need to parse out the response object:

    resp = client.get_object(|||your info|||)
    content = resp["Body"].read().decode("utf-8")

Enriching with JSON

  • Eventually you will want to be reading the JSON data back out of your data lake to use in populating your data warehouse
  • Some options available here:
    • Read the data into a Python dictionary, and work with it in that format
    • Read the data directly into a tabular format using DuckDB
  • In almost all cases, working with the data inside of DuckDB is going to be massively faster, and also makes it easy to relate to other relational data
    • Requires a little (but not huge) amount of work to handle the nested data

DuckDB Structs

  • Nested data shows up in DuckDB as a struct data type
  • DuckDB has a whole system in place for working with structs
  • Perhaps the single most useful function though is UNNEST, which can also take a recursive := true option
    • This flattens out the nested content doing one of two things:
      • Unnesting an array? You’ll get new rows added to the table for each element in the array
      • Unnesting a map/dictionary? You’ll get the keys added as new columns
SELECT name, age, UNNEST(classes, recursive := true)
FROM 's3://bucket/myjson.json'

Accessing DuckDB values

  • Unnesting can be extremely useful if the main content that you want belongs in one of the sub-documents

  • If you just need to access a bit of information from a sub-document though, you can access it with dot notation:

    SELECT name, age, classes.name
    FROM 's3://bucket/myjson.json
  • Use classic python indexing square brackets to extract enumerated items from arrays

    • This is 1-based to agree with other SQL conventions (and thus different from MongoDB’s 0-based)

Your Time!

For Next Week

  • Milestone 3 is live, and the MongoDB database has been live since Saturday
  • You have a dashboard due on Tuesday!
  • If you haven’t gotten snapshots going yet, I’d recommend making sure those are working before you leave tonight
  • Then you can just worry about the new dimensional modeling and how you will use the document data to enrich the data you already have
  • Keep an eye out for my poll about how useful the past group’s documentation has been to you
// reveal.js plugins // Added plugins