---
title: "Document Aggregations and Interfaces"
author: Jed Rembold
date: July 2, 2025
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
:::{style='font-size:.9em'}
- I got some responses back to all of your reflections so far
- I'm working on feedback for Milestone 2
  - Feedback on the documentation will come next week probably
  - Poll about documentation going out tomorrow
- I actually have a completed syllabus for you (gasp)!
- Milestone 3 due next Tuesday
- Don't forget your weekly reflections today (if you haven't already for this week)!
:::


## Tonight
- MongoDB
  - Aggregations
  - Updates
  - Deletions
- Working with Documents and JSON
  - Airflow's MongoHook
  - Serialization
  - DuckDB Unpacking
- Work time

# Further MongoDB Operations

## Querying SubDocuments
- We saw last class that when we can query a collection for documents that match various key requirements
- What if we wanted to filter on the contents of a sub-document though?
- Consider the following document:
  ```{.json style='font-size:.8em'}
  { 
    "name": "Billy Johnson",
    "year": "senior",
    "classes": [
      {"name": "algebra", "grade": "A", "semester": "Fall"},
      {"name": "english", "grade": "C", "semester": "Spring"},
      {"name": "chemistry", "grade": "F", "semester": "Fall"},
      {"name": "spanish", "grade": "A", "semester": "Spring"}
    ]
  }
  ```

## Accessing Subdocuments
- We can specify a subdocument in an array through dot notation
- For instance:
  ```python
  db.|||collection|||.find({'classes.name': 'algebra'})
  ```
  would find all students who took a class named algebra
- You can specify location in an array with the same dot notation (0-indexed)
- Keep in mind that you are still searching at the _document_ level though!
  ```python
  db.|||collection|||.find({'classes.name': 'algebra',
                    'classes.grade': 'A'})
  ```
  would find all documents where a student took a class named algebra and took a class where they got an A. **They might not have gotten the A in algebra!**

## Matching subdocuments with several properties
- If we want to find all documents where a student got an A in algebra, we need some different tools
- `$elemMatch` will allow us to specify that a particular element of an array must meet all of the provided requirements
- Example:
  ```python
  db.|||collection|||.find({'classes': {
                      '$elemMatch': {
                        'name': 'algebra', 
                        'grade': 'A'
                      } } })
  ```

## Selecting only subdocuments
- It is not difficult to imagine situations where you only care about the sub-documents
- Could we extract only those from find?
  - Sorta. We could select only the surrounding key (`'classes'` in our current example)
- But that really isn't the same as getting back a list of all the classes students have taken
- To go further, we need to add another tool: MongoDB's **aggregate**


## MongoDB's Aggregate
- The `aggregate` function in MongoDB actually does far more than just compute aggregate values
- Best to think of it as a series of steps or a pipeline in and of itself
  - You give it a series of calculations to compute, and the result of each is used as the input to the next
  - Should be very familiar to those of you coming from a strong tidyverse background
- Basic syntax will look like:
  ```python
  db.|||collection|||.aggregate([|||list of operations|||])
  ```

## Aggregate Stages
- Each stage of the aggregate pipeline can take a variety of forms. Common forms include:

:::{style='font-size:.8em'}

| Stage Name   | Description            | Example                          |
|--------------|------------------------|----------------------------------|
| `"$match"`   | Filters documents      | `{"$match": {"year": "senior"}}` |
| `"$project"` | Include/Exclude fields | `{"$project": {"name": 1}`       |
| `"$unwind"`  | Flatten arrays         | `{"$unwind": "$classes"}`        |
| `"$group"`   | Aggregate by group     | `{"$group": {"_id": "year"}}`    |
| `"$sort"`    | Order the results      | `{"$sort": {"name":1}}`          |
| `"$count"`   | Count the documents    | `{"$count": "name"}`             |


:::

## Actual Aggregates
- If using `$group`, you need to follow a particular syntax:
  - `"_id"` is the field that you want to be grouping by
    - If you want an aggregate over _everything_ (no grouping) set this to `null`
  - a comma separated list of field names and their corresponding aggregate calculation
- Common aggregate functions include:

:::{style='font-size:.8em'}
| Name            | Description                        | Example                  |
|-----------------|------------------------------------|--------------------------|
| `"$sum"`        | Sum numeric values or literals     | `{"$sum": 1}`            |
| `"$avg"`        | Take the average of numeric fields | `{"$avg": "$fieldname"}` |
| `"$min"/"$max"` | Find the min or max of a field     | `{"$min": "$name"}`      |
:::

## What is a field worth?
- You may have noticed that sometimes we put a `$` in front of a field name, and sometimes we don't!
- This can be confusing, and early on it is probably worth trying both ways if you are getting some confusing results
- In general though, it breaks down to:
  - Are you talking about the literal name of the field? Usually for projecting, or creating fields? Then no `$` in needed
  - Are you talking about the _contents_ of a field? For filtering or comparisons? Then you need the `$` in front
- The `$` basically tells MongoDB that you are looking specifically at the contents OF that field (and the field name itself is otherwise unimportant)


## Practice Time!
- On the `sources.advde` server you can connect to the `demos` using the same login info as the `metro_source` relational database
- Use the `students` collection to answer the following questions
  - What is the most popular class taken?
  - What is the average age broken down by letter grade? (The average age of students who got an A vs B vs C, etc?)


## Updating Entries
- Updating entries follows a very similar pattern to SQL. You need:
  - A filter to specify which entries you want to be updating
  - A statement to set the new value
- You can update one or many documents at once with `update_one` or `update_many`
```python
db.collection.update_one(
  {"name": "Amelia Rivera"},
  {"$set": {"age": 17}})
```

## Beyond Set
- While `"$set"` will replace the entire field value, you sometimes want more control.
- There are a variety of other ways you could interact with the value of a field:
  - `"$unset"` will just remove the field
  - `"$inc"` will increment a specified numeric field by some amount
  - `"$rename"` will rename the field itself to something different
  - `"$addToSet"` will add a value to an array if it is not already there

## Deleting Entries
- The same rules apply for deleting entries as in SQL: you need to specify the filter condition
- Just like with SQL, I'd **highly** suggest you test your filter condition in `find` before accidentally deleting documents you didn't mean to!
- The exists both `delete_one` and `delete_many`
```python
db.|||collection|||.delete_many(
  {"age": {"$gte": 18}})
```

# Break Time!

# Working with Documents and JSON

## Filling the Gaps
- There are a variety of smaller tools to be aware of when it comes to working with document data in our pipelines
- This next portion is intended to help fill or smooth over some of those little gaps, so that you don't have to discover them on your own
- In particular, we want to ensure that we have an understanding of:
  - How to connect to MongoDB from Airflow
  - How to snapshot documents/collections
  - How to reconcile those documents with the relational data you already have


## MongoDB from Airflow
- We have been using the `pymongo` library to interact with MongoDB
- This can still work in Airflow, but you'd have to code the connections in yourself
- Alternatively, you can import `MongoHook` from `airflow.providers.mongo.hooks.mongo`
  - This is essentially just a wrapper around `pymongo` that let's you connect to a defied MongoDB connection
- Either library does not come with the stock Airflow image, so you'll need to add them to your `Dockerfile`
  - `pymongo` or
  - `apache-airflow-providers-mongo`


## The Mongo Provider
- If you go for adding the Mongo providers library, it will also unlock a "mongo" connection type within Airflow
- You can define your connection just like you could for the other connections we have used
- **Important**: Under the "Extra Fields" tab, click on the "Use SSL" option and then click it back off.
  - It needs to be set to false, but defaults to null (although the toggle looks initially off)
  - You can check the "Extra Fields JSON" to enture it is set to `false` (the other extra values should be `null` or `false`)


## Hooking the Mongo
- To use your newly setup connection, make sure you import `MongoHook` into your DAG
  ```python
  from airflow.providers.mongo.hooks.mongo import MongoHook
  ```
- Then you can connect to the Airflow connection you have made
  ```python
  hook = MongoHook(mongo_conn_id=|||your connection name|||)
  client = hook.get_conn()
  ```
- Once you have the client, everything is as it was when we were just using `pymongo`
  - Get the database you want, then the collection, and start querying away


## BSON vs JSON
- MongoDB technically uses _BSON_ under the hood
  - Stands for _Binary JSON_
- This comes with a few perks:
  - Support for some extra types that JSON does not have
  - Supports metadata that improves lookup speeds
  - Field order is preserved
  - **Fast**, and optimized for quick decoding
- In the majority of cases, the transition to JSON is seamless, and so we can treat it as such
  - But there are some cases that require some special treatment


## Embedded Objects
- Sometimes a MongoDB document will contain embedded types
  - PyMongo will generally handle converting these to a Python equivalent, but that doesn't not equate necessarily to a perfect JSON equivalent
- A prime example is the `"_id"` field, which has as a corresponding value some `ObjectId()` object
  - This object will **not** natively map to JSON without some help
- The easiest fix is to import `json_util` from the `bson` library:
  ```python
  from bson import json_util
  ```
- This gives you the `json_util.dumps` function, which will take a `bson` object and convert it to a legitimate JSON _string_


## Working with JSON
- Some of you have more or less experience in working with JSON in Python
- In general, you want to import `json`, which gives you access to:
  - `json.load`: reads from an open JSON file to populate a Python dictionary
  - `json.dump`: writes to an open JSON file to write a Python dictionary to disk
  - `json.loads`: reads a string of JSON into a Python dictionary
  - `json.dumps`: writes a Python dictionary to a JSON string
- If working directly with files, you need to open the file yourself:
  ```python
  with open(|||filename|||, |||mode|||) as fh:
    data = json.load(fh)
    json.dump(data, fh)
  ```

## JSON to S3
- You can place any time of file in S3 storage, and JSON is probably an obvious choice for document snapshots
  - Allows you to keep it in as close to native form as possible
- Easiest method is to use the Boto3 client
  - Some of you have been using this already, which is fine!
  - I have added another function to the `utils.py` package to help with this called `get_minio_boto_client`
- Usage would look like:
  ```python
  from utils import get_minio_boto_client

  s3_client = get_minio_boto_client(|||your minio connection name|||)
  ```

## Working with the Boto Client
- The boto client then gives you methods to directly read or write files from the S3 storage with `|||client|||.get_object()` and `|||client|||.put_object()`
- Need to provide those methods several parameters:
  - `Bucket`: the name of the bucket you are reading or writing to
  - `Key`: essentially the path/filename of where you are reading or writing to
  - `Body`: the text you are writing (or binary data). Only needed for writing
  - `ContentType`: the mime type of the document (`application/json`). Optional
- If reading from the S3 storage, you will then need to parse out the response object:
  ```python
  resp = client.get_object(|||your info|||)
  content = resp["Body"].read().decode("utf-8")
  ```


## Enriching with JSON
- Eventually you will want to be reading the JSON data back out of your data lake to use in populating your data warehouse
- Some options available here:
  - Read the data into a Python dictionary, and work with it in that format
  - Read the data directly into a tabular format using DuckDB
- In almost all cases, working with the data inside of DuckDB is going to be massively faster, and also makes it easy to relate to other relational data
  - Requires a little (but not huge) amount of work to handle the nested data


## DuckDB Structs
- Nested data shows up in DuckDB as a `struct` data type
- DuckDB has a [whole system](https://duckdb.org/docs/stable/sql/data_types/struct) in place for working with structs
- Perhaps the single most useful function though is `UNNEST`, which can also take a `recursive := true` option
  - This flattens out the nested content doing one of two things:
    - Unnesting an array? You'll get new rows added to the table for each element in the array
    - Unnesting a map/dictionary? You'll get the keys added as new columns

```sql
SELECT name, age, UNNEST(classes, recursive := true)
FROM 's3://bucket/myjson.json'
```

## Accessing DuckDB values
- Unnesting can be extremely useful if the main content that you want belongs in one of the sub-documents
- If you just need to access a bit of information from a sub-document though, you can access it with dot notation:
  ```sql
  SELECT name, age, classes.name
  FROM 's3://bucket/myjson.json
  ```
- Use classic python indexing square brackets to extract enumerated items from arrays
  - This is 1-based to agree with other SQL conventions (and thus different from MongoDB's 0-based)


# Your Time!

## For Next Week
- Milestone 3 is live, and the MongoDB database has been live since Saturday
- You have a dashboard due on Tuesday!
- If you haven't gotten snapshots going yet, I'd recommend making sure those are working before you leave tonight
- Then you can just worry about the new dimensional modeling and how you will use the document data to enrich the data you already have
- Keep an eye out for my poll about how useful the past group's documentation has been to you
