Jed Rembold
June 25, 2025
DuckDB can accept BASH globs to expand out a pattern into a list of files
Usually this would utilize a wildcard character
* to represent “some amount of anything”
*.parquet would expand to
represent all of the parquet files in a given folderReading in a bunch of snapshot parquet files into DuckDB is then as easy as:
SELECT *
FROM 's3://your-bucket/*.parquet'This essentially reads in each parquet file to its own table, and
then UNIONs them all together
Sometimes it is useful to know what file each row came from, if
you want that, then you need to use the full
read_parquet function, and add a
filename=true parameter
SELECT *
FROM read_parquet('s3://your-bucket/*.parquet', filename=true)For snapshots, your schema are likely the same, so the union is obvious. But what if your parquet files have slightly different schema?
union_by_name=true optionLAG, partitioning by your row identifier and
ordering by the snapshot timesLEAD to get when the next
version would have started, subtracting a day to get when that rows
version endedOften we need an easy way to compare the contents of many columns
Different approaches, but one common one is to concatenate all the data as text, and then run a hash function on the row
md5() or
hash() hash functionsMight look something like:
SELECT pid, fname, age, hash(fname, age) AS current_hash
FROM |||mytable|||LAG
the hash so that we can compare one rows hash to the previous
prev_hashcurrent_hash = prev_hash)?
Then no change!NULL - this implies
the original entry of that rowLEAD the
snapshot dates to see when the next version’s snapshot began
As a last step, you would usually just select out the columns that you want in your final dimension table (none of the hash stuff)
Technically, you could always find the current, most recent
entries by filtering for where
expiry_date IS NULL
If you want something like an
is_active column, you could add such a
column as a boolean easily with
SELECT pid, fname, age, effective_date, expiry_date,
expiry_date IS NULL AS is_active
FROM |||last steps table|||

psql for Postgres)The primary way to connect to a Mongo database is through a connection string, extremely similar to how it worked with Postgres
The general form of the string is:
mongodb://|||username|||:|||password|||@|||host|||:|||port|||/|||database|||In Python then:
from pymongo import MongoClient
client = MongoClient(
"mongodb://jed:mypswd@localhost:27017"
)
db = client['database']Once you are connected, you can see what databases are available:
print(client.list_database_names())And then you can switch to that database with:
db = client.get_database(|||database name|||)To see the available collections within the database:
print(db.list_collection_names())You can insert documents one at a time:
db.|||collection|||.insert_one(
{"name": "Jed", "teaching": ["DATA510", "DATA599"]}
)Or many at once:
db.|||collection|||.insert_many(
[{"name": "Jed", "teaching": ["DATA510", "DATA599"]},
{"name": "Rachel", "teaching": ["DATA510", "CS549"]}]
)You can call the .find() method on a
collection to grab all of the documents within the collection
Technically, in PyMongo, the .find()
method returns an iterable, so you could loop over it:
for document in db.|||collection|||.find():
print(document)If you just want a list of the documents:
list(db.|||collection|||.find())
If you just want the first document:
db.|||collection|||.find_one()
What if we want to filter or sort the documents / data we get back?
We can utilize a collection of query selectors to the
.find() method to fine-tune what we get
back
General form looks like:
db.|||collection|||.find({|||key|||: |||filter_document|||})
where the |||filter_document||| would
either be a literal value or a document that specifies an specific query
selector
| Name | Description | Example |
|---|---|---|
$eq |
Matches values that are equal | {'$eq': 5} |
$gt |
Matches values that are greater | {'$gt': 8} |
$gte |
Matches values that are greater or eq | {'$gte': 8} |
$lt |
Matches values that are smaller | {'$lt': 3} |
$lte |
Matches values that are smaller or eq | {'$lte': 3} |
$in |
Matches values appearing in the array | {'$in': [1,2]} |
$ne |
Matches values that are not equal | {'$ne': 'Jed'} |
$regex |
Matches values that match the regexp | {'$regex': '^J'} |
Just like we didn’t always want all the columns from a table, you may only want certain fields from the document
You can specify this with a second argument after the filtering
argument in .find()
db.|||collection|||.find({}, |||field_choice_document|||)The document to choose fields is essentially a blacklist / whitelist of the desired fields
To indicate fields you want to keep, list the field name and a 1
db.|||collection|||.find({}, {'name': 1, 'age': 1})To indicate fields you do not want, just provide a 0 instead
db.|||collection|||.find({}, {'birthday': 0, 'parents': 0})
which will get you all of the fields except those indicated
You can never mix 0s and 1s except with the
_id field that is automatically assigned when
a document is inserted
There are also logical operators to combine various operations
| Name | Description |
|---|---|
$and |
Returns documents where both queries match |
$or |
Returns documents where either query matches |
$nor |
Returns documents where both queries fail to match |
$not |
Returns documents where the query does not match |
Except for $not, each has a list of
query operators following it
sources.advde) which will contain a
collection of information related to each stationturnstile_id