Jed Rembold
June 4, 2025
SQLExecuteQueryOperator for
instance, needs to know what database to run the SQL query againstChoose aws as your connection type
(even if you aren’t actually using AWS)
You are going to need your Access Key ID and Secret Access Key. You will need to set those up in MinIO
Additionally, you need to set up some extra fields in the provided JSON:
{
"endpoint": "minio.advde:9000",
"region_name": "us-east-1"
}minio.advde:9001
One thing that isn’t saved in the S3 connection is which bucket you want to connect to
Advisable to add the bucket name as a Variable in Airflow
Admin → Variables and then “Add Variable”
You can access within your Python tasks using
Variable.get({name})
from airflow.models import Variable
bucket_name = Variable.get("s3_bucket")SQLExecuteQueryOperator, for
instance, just takes a conn_id fieldBaseHook.get_connection({your conn_id}), and
then unpack it from therePythonOperators were forming the bulk of
tasks previouslyWe know that we can define a DAG using the
DAG() function
Often this is also used with a context operator, to remove the need to assign the output to a value
with DAG(...) as dag:
|||Define tasks and dependencies|||In taskflow syntax, we can define a function and
make it a DAG using the
@dag operator
@dag(...)
def my_dag():
|||tasks and dependencies|||If what we want is a PythonOperator
type task, we can just define a Python function and add the
@task decorator
@task
def clean_table():
|||Python code to read and clean a table|||The task id is automatically assigned to the name of the function
The dag is assigned to whatever dag ends up calling this function
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def extract(**kwargs):
data = [1, 2, 3]
kwargs['ti'].xcom_push(key='data', value=data)
def transform(**kwargs):
data = kwargs['ti'].xcom_pull(key='data', task_ids='extract')
transformed = [x * 10 for x in data]
kwargs['ti'].xcom_push(key='transformed', value=transformed)
def load(**kwargs):
transformed = kwargs['ti'].xcom_pull(key='transformed', task_ids='transform')
print(f"Loaded data: {transformed}")
with DAG('classic_xcom_dag', start_date=datetime(2023, 1, 1), schedule_interval='@daily', catchup=False) as dag:
t1 = PythonOperator(task_id='extract', python_callable=extract, provide_context=True)
t2 = PythonOperator(task_id='transform', python_callable=transform, provide_context=True)
t3 = PythonOperator(task_id='load', python_callable=load, provide_context=True)
t1 >> t2 >> t3
from airflow.sdk import dag, task
from datetime import datetime
@dag(start_date=datetime(2023, 1, 1), schedule_interval='@daily', catchup=False)
def taskflow_dag():
@task
def extract():
return [1, 2, 3]
@task
def transform(data):
return [x * 10 for x in data]
@task
def load(transformed):
print(f"Loaded data: {transformed}")
data = extract()
transformed = transform(data)
load(transformed)
dag_instance = taskflow_dag()
>> syntax (or
.set_upstream()) to create other
dependencies if not every task feeds into the nextThe TaskFlow syntax also makes it simple to add parameters to the DAGs themselves
@dag(...)
def my_dag(name: str = 'Jed'):
|||defined tasks|||When you run a DAG manually, you can set these parameters to whatever you want, which might be useful for testing
When the DAG is run on a schedule, it will always use the default value, so you must set one (assuming you’ll be running the DAG on a schedule)
@task functionget_current_context() from
airflow.operators.python. Which gets you a
dictionary of all the available variables| Variable Name | Type | Meaning / Use Case |
|---|---|---|
ds |
str |
The logical execution date in YYYY-MM-DD
format. Used for templating filenames, partitioned data, etc. |
ds_nodash |
str |
Same as ds but without dashes. Often
used in file names. |
logical_date |
datetime |
Python datetime object of the logical
run date. Preferred when you need a datetime
type. |
data_interval_start |
datetime |
Start of the data interval (inclusive). Use this for windowed data processing. |
data_interval_end |
datetime |
End of the data interval (exclusive). Pairs with
data_interval_start. |
prev_execution_date |
datetime or None |
Logical date of the previous run, if available. Useful for delta-based logic. |
next_execution_date |
datetime or None |
Logical date of the next run (planned). |
dag_run.conf |
dict |
Dictionary of parameters passed when manually triggering a DAG. |
Given their different use-cases, we should not be suprised at differences in organizing the data!
The university collects the following data:
POSTGRES_USERPOSTGRES_PASSWORDPOSTGRES_DB