---
title: "The Shortest Path"
author: Jed Rembold
date: July 16, 2025
slideNumber: true
theme: catppuccin
highlightjs-theme: catppuccin-mocha
width: 1920
height: 1080
transition: slide
---


## Announcements
:::{style='font-size:.9em'}
- My only goal over the next two days is getting you feedback
- Documentation poll link announced earlier today
- Milestone 4 due Tuesday night
  - Many of you are already going to have a busy Tuesday evening, so plan accordingly
  - Transfer happening next Wednesday afternoon, so stuff needs to be completely done by noon
  - Don't forget documentation!
- Don't forget your weekly reflections today (if you haven't already for this week)!
:::


## Tonight
- Determining Shortest Paths
  - Neo4j
  - NetworkX
- Airflow Organization
  - Triggering other dags
  - Assets
- Work time


# Determining the Shortest Path(s)

## Connecting Paths
- One thing that graph databases facilitate is looking at the connecting paths between nodes
  - This can be quite complicated in other databases, requiring explicit looping or huge numbers of joins
- Looking at the various ways you can get from one node to another is a common problem in mathematical graph theory, and comes up frequently in real world applications
  - E.g. Route finding between two locations on a map, supply chain logistics, power grid optimization, neural connectivity
- Generally involve two sorts of questions:
  - Is the graph fully connected? Could you get to any node starting from any node?
  - What are the most optimal routes to move between different nodes?


## Visualized

![](../images/graph_paths.svg)


## Shortest Path Algorithms
- There are many shortest path algorithms
  - Specifically which you use likely would only matter in the largest of graphs
- Most important is that most any software that models data as a graph will provide _some_ methods for you to compute shortest distance paths
- The requirement will be that you need the data in a graph form
  - This has some implications for handling snapshots, which are likely no longer in a graph form


## Option 1: Shortest Paths within Neo4j
- One option is to pre-compute all possible shortest paths between each node pair and snapshot that as well
  - Likely only works reasonably for smaller graph networks, as otherwise the amount of possible node pairs gets huge
- The benefit is that the computation happens on the server where the Neo4j database lives, which may well have more compute available than wherever your Airflow worker runs


## Simple Shortest Paths: Neo4j
- Cypher actually has syntax for extracting the shortest path by connection between two points:
  ```cypher
  MATCH p = SHORTEST 1 (A:Loc)-[:LINK*]-(B:Loc)
  WHERE A.name = 'Start' AND B.name = 'End'
  RETURN [n in nodes(p) | n.name] AS stops
  ```
- Instead of `SHORTEST 1` to get just a single path, you can use `ALL SHORTEST` to get all shortest paths, in the case of ties
- If you just need to query a single pair of nodes, this is all you need!


## The Neo4j Graph Data Science Library
- This syntax though is missing two potentially important requirements:
  - Weighting the paths by some attribute
  - Easily computing all possible pairs
- For this purpose, Neo4j has what it calls the Graph Data Science library, or GDS
- The GDS gives access to lower-level, parallelized versions of common graph algorithms
- Need to activate in the Docker Compose
  - Add the environment variable: `NEO4J_PLUGINS: ["graph-data-science"]`


## Using the GDS
- The usual GDS workflow looks something like:
  - Project an in-memory graph from the Neo4j database
  - Choose and run the desired algorithm is one of several modes:
    - `stream` just lets you retrieve the results as a query result
    - `mutate` updates the in-memory graph with the output of the algorithm
    - `write` updates the Neo4j database with the output of the algorithm
- In-memory graphs are only cleaned up automatically _when the Neo4j server is reset_
  - If you are going to use this route, **please** clean up after yourself

## The Projection
- There are a few ways you can project a graph from a Neo4j database, but the easiest is what is called the _native projection_:
  ```cypher
  CALL gds.graph.project(
    graphName,
    nodeProj,
    relationProj,
    { relationshipProperties: ["weight_property"]}
  )
  ```

## The Calculation
- Once you have projected the graph, you can compute all the distances
  ```cypher
  CALL gds.allShortestPaths.stream(
    graphName,
    { relationshipWeightProperty: "weight_property"}
  ) YIELD sourceNode, targetNode, distance
  RETURN
    gds.util.asNode(sourceNode).name AS source_name,
    gds.util.asNode(targetNode).name AS target_name,
    distance
  ```
- **Note that this only gets you the distances, not the paths!**

## Finishing up
- To get the actual paths, you'd have to loop over the pairs and query
  - This is unavoidably cumbersome, and why I'll suggest another path if that is what you want
    - A shorter path, if you will...
- Regardless, to remove the in-memory graph you created:
  ```cypher
  CALL gds.graph.drop(graphName)
  ```

## NetworkX
- If you are trying to get the shortest path after you have taken data out of the graph database, you are going to need to recreate the graph
- The NetworkX library in Python is all about setting this up
- Import as `import networkx as nx`
  - You may need to add it to your Airflow Dockerfile


## Graph Types
- NetworkX maintains a few types of graphs
- Undirected graphs have no "arrows" pointing from node to node
  - The connection from node A to node B is the same as from Node B to Node A
  - Created with `G = nx.Graph()`{.python}
- Directed graphs have directionality from node to node
  - This is technically how Neo4j represents everything: it may or may not matter to your model
  - Created with `G = nx.DiGraph()`{.python}


## Populating NetworkX Graphs
- You can add nodes individually:
  ```python
  G.add_node(|||node label|||, |||property keywords and values|||)
  ```
- You can add edges:
  ```python
  G.add_edge(|||start node label|||, |||end node label|||, |||properties|||)
  ```
  - Any node labels not created will be automatically created
- You can import from a pandas edgelist
  ```python
  G = nx.from_pandas_edgelist(
    |||df|||, |||source col|||, |||target col|||, |||list of property columns|||
  )
  ```
  - Makes an undirected graph by default

## Accessing Nodes and Edges
- Once you have created the graph, you can access various nodes and edges fairly easily
- Getting nodes (by label)
  ```python
  des_node = G.nodes[|||node label|||]
  ```
- Getting an edge (by adjacent nodes)
  ```python
  des_edge = G[|||node 1|||][|||node 2|||]
  ```
- You can loop over all the possible nodes or edges using `G.nodes()` or `G.edges()`

## Finding Shortest NetworkX Paths
- You can compute the shortest python between two points:
  ```python
  nx.shortest_path(
    |||your graph|||, 
    source=|||start label|||, 
    target=|||end label|||)
  ```
  - You can also add a `weight` keyword that references an edge property
- You can also compute them all at once:
  ```python
  nx.all_pairs_shortest_path(|||your graph|||)
  ```
  - This does **not** support a `weight` keyword. Use `all_pairs_dijkstra_path` instead for that

## Actual Lengths
- The previous functions all gave you the sequence of nodes from start to finish
- You could loop over them to extract the distances and add them yourself for the full length
- Alternatively, you can use `shortest_path_length`
  - `all_pairs_shortest_path_length` and `all_pairs_dijkstra_path_length` exist as well


## An Aside: GIS Shenanigans
- You have WKT strings in the Neo4j DB containing exact routes between stations
- If you want to do something with them in Python, some functions are useful
  ```python
  from shapely.wkt import loads
  from shapely.ops import linemerge
  ```
- The first takes a WKT _string_ and parses it into a format that the shapely library understands
- The second can take a list of LineString objects and concatenate them into a single LineString


# Break Time!

# Airflow Organization

## Why?
- At this point, you have likely have at least 5-10 separate DAGs written to manage everything that is going on
- **Some** of these DAGs very likely have dependencies on one another
  - For instance, you probably want your DAG to populate your warehouse to run after your DAG to take a snapshot
- The instant that we start using schedules to account for these dependencies (and we are all guilty of this), we are making our pipeline more frail
- So how can we account for this otherwise?


## Option 1: The Super DAG
- Perhaps the most obvious option is to collapse multiple DAGs into a single DAG
- This lets you very explicitly define the dependencies between them
- The good news about this is that it is fairly easy
- The bad news is that it:
  - Can rapidly get very confusing with a single monolithic DAG
  - Doesn't give you the option to run just a portion of the DAG, so you lose flexibility and visibility

## Option 2: Explict DAG Run Triggers
- Airflow has a built-in operator called the `TriggerDagRunOperator`
  ```python
  from airflow.operators.trigger_dagrun import TriggerDagRunOperator
  ```
- This is a explict task that you could add at the end of one DAG to explicitly call the next DAG when the current one finishes.
- You would then usually **unschedule** the DAG that is being triggered in this fashion
- You can also pass down other information to the downstream DAG should it be necessary


## Triggering Anatomy
- The `TriggerDagRunOperator` has the general form of:
  ```python
  next_dag_task = TriggerDagRunOperator(
    task_id = |||your chosen id|||,
    trigger_dag_id = |||the id of the dag you want to trigger|||,
    conf = |||optional dictionary of info to pass on|||,
    wait_for_completion = |||boolean if you want to wait|||,
    logical_date = '{{ ds }}', #to keep same date
    reset_dag_run = True # allows overwriting on same date
    )
  ```
- If you don't set `reset_dag_run` to `True`, the downstream DAG will only be triggered a max of once per day


## Option 3: Assets
- The recommended and more modern method of handling these sorts of dependencies is to utilize and define _assets_
- An assets is an abstract concept here, and could be anything: an S3 bucket, a database table, a JSON file, etc
- The idea is that we add logic to our DAGs and tasks that inform Airflow of whether any task requires a specific asset or **updates/creates** a given asset
- Once we make Airflow aware of this, we can then schedule DAGs to only run when certain assets have been updated


## Creating Assets
- Each asset is just labeled with a URI string
  - There is nothing magical about this URI string. It is not being used for anything other than to uniquely identify the asset
- Several syntactical ways to create, but I find the most transparent to be:
  ```python
  from airflow.sdk import Asset

  m1_fact_table = Asset(
    uri="postgres://localhost:5432/wh.public.m1fact", 
    name='m1fact')
  s3_bucket = Asset('s3_bucket')
  ```

## Outputting Assets
- Once you have created an asset, you can reference it elsewhere
- To specify that a given task updates or creates a given asset, add it to an `outlets` list
  ```python
  @task(outlets=[s3_bucket])
  def write_to_s3():
    |||body of code|||
  ```
  or, for classic operators,
  ```python
  mytask = SQLExecuteQueryOperator(
    |||usual keywords and corresponding values|||
    outlets = [m1_fact_table]
    )
  ```

## Consuming Assets
- You can also specify that a given task consumes or requires a particular task
- Just use the `inlets` keyword
  ```python
  @task(inlets=[s3_bucket])
  def read_from_s3():
    |||body of code|||
  ```

<br>

:::{.block .alert name=Important!}
Setting up a inlet to a task does **nothing** as far as your code is concerned. It just helps tools that track lineage to have a record of what tasks touch or work with different blocks of data. It is thus more optional than `outlets`, but recommended.
:::

## Listening for Assets
- Now suppose you have a DAG that you only want to run once an particular asset has been updated or created
  - Maybe a dimension table that you only want to update once the data lake is updated
- In the desired dag, just change the `schedule`:
  ```python
  @dag(
    start_date = pendulum.datetime(2025,7,16),
    schedule = [asset1] #include as many as you want
  )
  def your_dag():
    |||normal dag code|||
  ```

## Scheduling Details
- `schedule = [asset1, asset2]`{.python} would mean that the DAG will wait until **both** assets are updated before it runs
- You can actually add more logic to this if you want:
  - `schedule = (asset1 | asset2)` would trigger when either `asset1` or `asset2` were updated
  - `schedule = (asset1 & (asset2 | asset3))` would trigger once `asset1` and either `asset2` or `asset3` were updated


## Final Assessment
- Assets represent the state of something _at a particular point in time_
  - This means that if you update an asset as part of a backfill, and such an update triggers a downstream DAG, the logical backfill date would be used for the downstream DAG as well
  - This is really nice, but **does** mean that backfills that touch an asset will trigger all downstream DAGs. Make sure you want that
    - If you need to do a backfill and don't want that, you can deactivate the downstreame DAG, run the backfill, and then reactivate it
- Assets are just Airflow bookkeeping! If an outside source modifies a table or bucket, nothing is going to trigger!
  - This also includes if you forget to set an `outlet` within Airflow

# Your Time!

## For Next Week
- Milestone 4 has been up and the Neo4j database live!
- Priorities:
  - Getting snapshots set up for the new database if you haven't already (and you are behind if that is the case)
  - Figure out your dimensional model for Milestone 4
  - Work on DAG(s) to populate that model
  - Ensure your documentation is up to date
- Milestone 4 is due next Tuesday
  - Next transfer happens next Wednesday, so dashboards and documentation **must** be in on time (or at least by noon on Wednesday)