Jed Rembold
July 2, 2025
We saw last class that when we can query a collection for documents that match various key requirements
What if we wanted to filter on the contents of a sub-document though?
Consider the following document:
{
"name": "Billy Johnson",
"year": "senior",
"classes": [
{"name": "algebra", "grade": "A", "semester": "Fall"},
{"name": "english", "grade": "C", "semester": "Spring"},
{"name": "chemistry", "grade": "F", "semester": "Fall"},
{"name": "spanish", "grade": "A", "semester": "Spring"}
]
}We can specify a subdocument in an array through dot notation
For instance:
db.|||collection|||.find({'classes.name': 'algebra'})
would find all students who took a class named algebra
You can specify location in an array with the same dot notation (0-indexed)
Keep in mind that you are still searching at the document level though!
db.|||collection|||.find({'classes.name': 'algebra',
'classes.grade': 'A'})
would find all documents where a student took a class named algebra and took a class where they got an A. They might not have gotten the A in algebra!
If we want to find all documents where a student got an A in algebra, we need some different tools
$elemMatch will allow us to specify
that a particular element of an array must meet all of the provided
requirements
Example:
db.|||collection|||.find({'classes': {
'$elemMatch': {
'name': 'algebra',
'grade': 'A'
} } })'classes' in our current example)The aggregate function in MongoDB
actually does far more than just compute aggregate values
Best to think of it as a series of steps or a pipeline in and of itself
Basic syntax will look like:
db.|||collection|||.aggregate([|||list of operations|||])| Stage Name | Description | Example |
|---|---|---|
"$match" |
Filters documents | {"$match": {"year": "senior"}} |
"$project" |
Include/Exclude fields | {"$project": {"name": 1} |
"$unwind" |
Flatten arrays | {"$unwind": "$classes"} |
"$group" |
Aggregate by group | {"$group": {"_id": "year"}} |
"$sort" |
Order the results | {"$sort": {"name":1}} |
"$count" |
Count the documents | {"$count": "name"} |
$group, you need to follow a
particular syntax:
"_id" is the field that you want to be
grouping by
null| Name | Description | Example |
|---|---|---|
"$sum" |
Sum numeric values or literals | {"$sum": 1} |
"$avg" |
Take the average of numeric fields | {"$avg": "$fieldname"} |
"$min"/"$max" |
Find the min or max of a field | {"$min": "$name"} |
$ in front of a field name, and sometimes we
don’t!$ in
needed$ in front$ basically tells MongoDB that you
are looking specifically at the contents OF that field (and the field
name itself is otherwise unimportant)sources.advde server you can
connect to the demos using the same login
info as the metro_source relational
databasestudents collection to answer
the following questions
update_one or
update_manydb.collection.update_one(
{"name": "Amelia Rivera"},
{"$set": {"age": 17}})
"$set" will replace the entire
field value, you sometimes want more control."$unset" will just remove the field"$inc" will increment a specified
numeric field by some amount"$rename" will rename the field itself
to something different"$addToSet" will add a value to an array
if it is not already therefind before
accidentally deleting documents you didn’t mean to!delete_one and
delete_manydb.|||collection|||.delete_many(
{"age": {"$gte": 18}})
pymongo library
to interact with MongoDBMongoHook
from airflow.providers.mongo.hooks.mongo
pymongo that let’s you connect to a defied
MongoDB connectionDockerfile
pymongo orapache-airflow-providers-mongofalse (the other extra values should be
null or
false)To use your newly setup connection, make sure you import
MongoHook into your DAG
from airflow.providers.mongo.hooks.mongo import MongoHookThen you can connect to the Airflow connection you have made
hook = MongoHook(mongo_conn_id=|||your connection name|||)
client = hook.get_conn()Once you have the client, everything is as it was when we were
just using pymongo
Sometimes a MongoDB document will contain embedded types
A prime example is the "_id" field,
which has as a corresponding value some
ObjectId() object
The easiest fix is to import
json_util from the
bson library:
from bson import json_utilThis gives you the json_util.dumps
function, which will take a bson object and
convert it to a legitimate JSON string
Some of you have more or less experience in working with JSON in Python
In general, you want to import json,
which gives you access to:
json.load: reads from an open JSON file
to populate a Python dictionaryjson.dump: writes to an open JSON file
to write a Python dictionary to diskjson.loads: reads a string of JSON into
a Python dictionaryjson.dumps: writes a Python dictionary
to a JSON stringIf working directly with files, you need to open the file yourself:
with open(|||filename|||, |||mode|||) as fh:
data = json.load(fh)
json.dump(data, fh)You can place any time of file in S3 storage, and JSON is probably an obvious choice for document snapshots
Easiest method is to use the Boto3 client
utils.py package to help with this called
get_minio_boto_clientUsage would look like:
from utils import get_minio_boto_client
s3_client = get_minio_boto_client(|||your minio connection name|||)The boto client then gives you methods to directly read or write
files from the S3 storage with
|||client|||.get_object() and
|||client|||.put_object()
Need to provide those methods several parameters:
Bucket: the name of the bucket you are
reading or writing toKey: essentially the path/filename of
where you are reading or writing toBody: the text you are writing (or
binary data). Only needed for writingContentType: the mime type of the
document (application/json). OptionalIf reading from the S3 storage, you will then need to parse out the response object:
resp = client.get_object(|||your info|||)
content = resp["Body"].read().decode("utf-8")struct data typeUNNEST, which can also take a
recursive := true option
SELECT name, age, UNNEST(classes, recursive := true)
FROM 's3://bucket/myjson.json'
Unnesting can be extremely useful if the main content that you want belongs in one of the sub-documents
If you just need to access a bit of information from a sub-document though, you can access it with dot notation:
SELECT name, age, classes.name
FROM 's3://bucket/myjson.jsonUse classic python indexing square brackets to extract enumerated items from arrays