Jed Rembold
May 26/27, 2026
DuckDB is an analytic database engine
(column oriented)DuckDB is commonly used within other scripting engines, but it can also operate standalone as a CLI tool
Install the CLI tool by following the quick guide here if you desire. It is already installed on your VM
Add Python bindings so that you can use it within Python
pip install duckdb --upgradeAdd R libraries so that you can use it within R
install.packages("duckdb").duckdb suffix), then the state of the
database is maintained, and reconnecting to it later
will show the same contentsDuckDB has excellent parquet support!
Query directly:
SELECT * FROM 'my_data.parquet'
WHERE state = 'OR';
Write existing data:
COPY my_table TO 'my_table.parquet' (FORMAT 'parquet');
import duckdb
con = duckdb.connect()
URI = 'postgresql://uname:pw@host:5432/db'
con.execute(f"ATTACH '{URI}' AS pg (TYPE postgres)")
con.sql("SELECT * FROM pg.my_table")
DuckDB can also read (and write) structure data directly from S3 buckets
Need to define a “secret”
CREATE OR REPLACE SECRET (
TYPE s3,
PROVIDER config,
KEY_ID 'your access key',
SECRET 'your secret key',
ENDPOINT 'host:port',
REGION 'us-east-1',
URL_STYLE 'path',
USE_SSL false
)Then can query from it:
SELECT * FROM 's3://bucket_name/filename.parquet'

dag decorator:from airflow.sdk import dag
import pendulum
@dag(
start_date = pendulum.datetime(2026, 5, 26, tz='PDT'),
schedule = "@daily",
catchup = False,
tags=['demo']
)
def my_first_dag():
|||tasks and ordering|||
my_first_dag()
| Property | Description |
|---|---|
dag_id |
This is provided by the name of the function. This is what will show up in the Airflow GUI. |
schedule |
Could be a full CRON expression or a preset like
@daily |
start_date |
The date at which the DAG should begin executing on its schedule. Note that DAGS run at the end of their interval after the start date |
catchup |
Defaults to False. Indicates whether to
retroactively run missed DAG runs |
The decorator for tasks is much simpler, though it can have optional parameters
Tasks are defined as sub-functions within the function that is the DAG
from airflow.sdk import dag, task
import pendulum
@dag(...)
def my_first_dag():
@task()
def clean_data():
|||Python code to read and clean a table|||So far we have just defined some tasks in a DAG, but we haven’t specified how they are connected!
Within the DAG function, you need to call the task functions
You specify ordering with >>. So if
you had two tasks:
@dag(...)
def my_dag():
@task
def task1():
||| Something happens |||
@task
def task2():
||| Something else happens |||
task1() >> task2()task1
should be utilized in task2@dag(...)
def my_dag():
@task
def task1():
return |||something|||
@task
def task2(x):
||| Do something with x|||
out = task1()
task2(out)
@task function**context. Or import from
airflow.sdk and use
get_current_context()@dag(...)
def my_dag():
@task
def task1(**context):
date_str = context['ds']
|||Use the date string|||
@dag(...)
def my_dag():
@task
def task1():
context = get_current_context()
date_str = context['ds']
|||Do something with the date string|||
| Variable Name | Type | Meaning / Use Case |
|---|---|---|
ds |
str |
The logical execution date in YYYY-MM-DD
format. Used for templating filenames, partitioned data, etc. |
ds_nodash |
str |
Same as ds but without dashes. Often
used in file names. |
logical_date |
datetime |
Python datetime object of the logical
run date. Preferred when you need a datetime
type. |
data_interval_start |
datetime |
Start of the data interval (inclusive). Use this for windowed data processing. |
data_interval_end |
datetime |
End of the data interval (exclusive). Pairs with
data_interval_start. |
prev_execution_date |
datetime or None |
Logical date of the previous run, if available. Useful for delta-based logic. |
next_execution_date |
datetime or None |
Logical date of the next run (planned). |
dag_run.conf |
dict |
Dictionary of parameters passed when manually triggering a DAG. |
Choose aws as your connection type
(even if you aren’t actually using AWS)
You are going to need your Access Key ID and Secret Access Key. Those have been provided to each group and should be saved somewhere in your documentation
Additionally, you need to set up some extra fields in the provided JSON:
{
"endpoint_url": "http://lake.advde:9000",
"region_name": "us-east-1",
"addressing_style": "path"
}lake.advde:9001
One thing that isn’t saved in the S3 connection is which bucket you want to connect to
Advisable to add the bucket name as a Variable in Airflow
Admin → Variables and then “Add Variable”
You can access within your Python tasks using
Variable.get({name})
from airflow.sdk import Variable
bucket_name = Variable.get("s3_bucket")pd.read_sql()
PostgresHookATTACH
methodThe easiest method here is to use the
PostgresHook function
from airflow.providers.postgres.hooks.postgres import PostgresHook
hook = PostgresHook(postgres_conn_id=|||your conn name|||)Then can use it to retrieve a dataframe from a query:
df = hook.get_df(
sql = "SELECT * FROM mytable",
df_type="pandas"
)There is not a built-in hook for DuckDB, so we need to extract the parameters from the connection into something we can use, namely a URI string
I provide the utility function
get_postgres_uri_from_conn that you can use
to do this
from utils import get_postgres_uri_from_conn
uri = get_postgres_uri_from_conn(|||your conn name|||)Then you use that to attach the Postgres database as we saw earlier, at which point you can run SQL on it as you see fit.
ObjectStoragePathRecent versions have introduced the
ObjectStoragePath object to attempt to
streamline the process of taking S3 credentials from a connection and
using them directly in your tasks
Create the object, providing the bucket and target file, as well as the connection id
from airflow.io.path import ObjectStoragePath
target = ObjectStoragePath(
"s3://bucket_name/filename.parquet",
conn_id=|||your conn name|||
)Reading parquet into a dataframe:
with target.open('rb') as f:
df = pd.read_parquet(f)Writing parquet from a dataframe
with target.open("wb") as f:
df.to_parquet(f)ObjectStoragePath isn’t quite there
with full DuckDB support (it only supports read operations)
Instead, we’ll create and run the S3 secret directly
To create this from a given connection id, you can import
gen_duckdb_s3_secret_from_conn from the
utils library I supply
con = duckdb.connect()
s3_secret = gen_duckdb_s3_secret_from_conn(|||conn name|||)
con.execute(s3_secret)Then we can read in data:
con.execute(
"CREATE TABLE my_tab AS SELECT * FROM 's3://bucket_name/filename'"
)Or write data
con.execute(
f"COPY my_tab TO 's3://bucket_name/filename' (FORMAT parquet)"
)airflow,
warehouse, and
documentationdocker-compose.yml file within itairflow for
both the user and password, which is fine