The Shortest Path

Jed Rembold

July 16, 2025

Announcements

My only goal over the next two days is getting you feedback
Documentation poll link announced earlier today
Milestone 4 due Tuesday night
- Many of you are already going to have a busy Tuesday evening, so plan accordingly
- Transfer happening next Wednesday afternoon, so stuff needs to be completely done by noon
- Don’t forget documentation!
Don’t forget your weekly reflections today (if you haven’t already for this week)!

Tonight

Determining Shortest Paths
- Neo4j
- NetworkX
Airflow Organization
- Triggering other dags
- Assets
Work time

Determining the Shortest Path(s)

Connecting Paths

One thing that graph databases facilitate is looking at the connecting paths between nodes
- This can be quite complicated in other databases, requiring explicit looping or huge numbers of joins
Looking at the various ways you can get from one node to another is a common problem in mathematical graph theory, and comes up frequently in real world applications
- E.g. Route finding between two locations on a map, supply chain logistics, power grid optimization, neural connectivity
Generally involve two sorts of questions:
- Is the graph fully connected? Could you get to any node starting from any node?
- What are the most optimal routes to move between different nodes?

Visualized

Shortest Path Algorithms

There are many shortest path algorithms
- Specifically which you use likely would only matter in the largest of graphs
Most important is that most any software that models data as a graph will provide some methods for you to compute shortest distance paths
The requirement will be that you need the data in a graph form
- This has some implications for handling snapshots, which are likely no longer in a graph form

Option 1: Shortest Paths within Neo4j

One option is to pre-compute all possible shortest paths between each node pair and snapshot that as well
- Likely only works reasonably for smaller graph networks, as otherwise the amount of possible node pairs gets huge
The benefit is that the computation happens on the server where the Neo4j database lives, which may well have more compute available than wherever your Airflow worker runs

Simple Shortest Paths: Neo4j

Cypher actually has syntax for extracting the shortest path by connection between two points:

MATCH p = SHORTEST 1 (A:Loc)-[:LINK*]-(B:Loc)
WHERE A.name = 'Start' AND B.name = 'End'
RETURN [n in nodes(p) | n.name] AS stops

Instead of SHORTEST 1 to get just a single path, you can use ALL SHORTEST to get all shortest paths, in the case of ties
If you just need to query a single pair of nodes, this is all you need!

The Neo4j Graph Data Science Library

This syntax though is missing two potentially important requirements:
- Weighting the paths by some attribute
- Easily computing all possible pairs
For this purpose, Neo4j has what it calls the Graph Data Science library, or GDS
The GDS gives access to lower-level, parallelized versions of common graph algorithms
Need to activate in the Docker Compose
- Add the environment variable: NEO4J_PLUGINS: ["graph-data-science"]

Using the GDS

The usual GDS workflow looks something like:
- Project an in-memory graph from the Neo4j database
- Choose and run the desired algorithm is one of several modes:
  - stream just lets you retrieve the results as a query result
  - mutate updates the in-memory graph with the output of the algorithm
  - write updates the Neo4j database with the output of the algorithm
In-memory graphs are only cleaned up automatically when the Neo4j server is reset
- If you are going to use this route, please clean up after yourself

The Projection

There are a few ways you can project a graph from a Neo4j database, but the easiest is what is called the native projection:
```
CALL gds.graph.project(
  graphName,
  nodeProj,
  relationProj,
  { relationshipProperties: ["weight_property"]}
)
```

The Calculation

Once you have projected the graph, you can compute all the distances

CALL gds.allShortestPaths.stream(
  graphName,
  { relationshipWeightProperty: "weight_property"}
) YIELD sourceNode, targetNode, distance
RETURN
  gds.util.asNode(sourceNode).name AS source_name,
  gds.util.asNode(targetNode).name AS target_name,
  distance

Note that this only gets you the distances, not the paths!

Finishing up

To get the actual paths, you’d have to loop over the pairs and query
- This is unavoidably cumbersome, and why I’ll suggest another path if that is what you want
  - A shorter path, if you will…
Regardless, to remove the in-memory graph you created:
```
CALL gds.graph.drop(graphName)
```

NetworkX

If you are trying to get the shortest path after you have taken data out of the graph database, you are going to need to recreate the graph
The NetworkX library in Python is all about setting this up
Import as import networkx as nx
- You may need to add it to your Airflow Dockerfile

Graph Types

NetworkX maintains a few types of graphs
Undirected graphs have no “arrows” pointing from node to node
- The connection from node A to node B is the same as from Node B to Node A
- Created with G = nx.Graph()
Directed graphs have directionality from node to node
- This is technically how Neo4j represents everything: it may or may not matter to your model
- Created with G = nx.DiGraph()

Populating NetworkX Graphs

You can add nodes individually:

G.add_node(|||node label|||, |||property keywords and values|||)

You can add edges:
```
G.add_edge(|||start node label|||, |||end node label|||, |||properties|||)
```
- Any node labels not created will be automatically created

You can import from a pandas edgelist

G = nx.from_pandas_edgelist(
  |||df|||, |||source col|||, |||target col|||, |||list of property columns|||
)

Makes an undirected graph by default

Accessing Nodes and Edges

Once you have created the graph, you can access various nodes and edges fairly easily
Getting nodes (by label)
```
des_node = G.nodes[|||node label|||]
```

Getting an edge (by adjacent nodes)

des_edge = G[|||node 1|||][|||node 2|||]

You can loop over all the possible nodes or edges using G.nodes() or G.edges()

Finding Shortest NetworkX Paths

You can compute the shortest python between two points:
```
nx.shortest_path(
  |||your graph|||, 
  source=|||start label|||, 
  target=|||end label|||)
```
- You can also add a weight keyword that references an edge property
You can also compute them all at once:
```
nx.all_pairs_shortest_path(|||your graph|||)
```
- This does not support a weight keyword. Use all_pairs_dijkstra_path instead for that

Actual Lengths

The previous functions all gave you the sequence of nodes from start to finish
You could loop over them to extract the distances and add them yourself for the full length
Alternatively, you can use shortest_path_length
- all_pairs_shortest_path_length and all_pairs_dijkstra_path_length exist as well

An Aside: GIS Shenanigans

You have WKT strings in the Neo4j DB containing exact routes between stations
If you want to do something with them in Python, some functions are useful
```
from shapely.wkt import loads
from shapely.ops import linemerge
```
The first takes a WKT string and parses it into a format that the shapely library understands
The second can take a list of LineString objects and concatenate them into a single LineString

Break Time!

Airflow Organization

Why?

At this point, you have likely have at least 5-10 separate DAGs written to manage everything that is going on
Some of these DAGs very likely have dependencies on one another
- For instance, you probably want your DAG to populate your warehouse to run after your DAG to take a snapshot
The instant that we start using schedules to account for these dependencies (and we are all guilty of this), we are making our pipeline more frail
So how can we account for this otherwise?

Option 1: The Super DAG

Perhaps the most obvious option is to collapse multiple DAGs into a single DAG
This lets you very explicitly define the dependencies between them
The good news about this is that it is fairly easy
The bad news is that it:
- Can rapidly get very confusing with a single monolithic DAG
- Doesn’t give you the option to run just a portion of the DAG, so you lose flexibility and visibility

Option 2: Explict DAG Run Triggers

Airflow has a built-in operator called the TriggerDagRunOperator

from airflow.operators.trigger_dagrun import TriggerDagRunOperator

This is a explict task that you could add at the end of one DAG to explicitly call the next DAG when the current one finishes.
You would then usually unschedule the DAG that is being triggered in this fashion
You can also pass down other information to the downstream DAG should it be necessary

Triggering Anatomy

The TriggerDagRunOperator has the general form of:

next_dag_task = TriggerDagRunOperator(
  task_id = |||your chosen id|||,
  trigger_dag_id = |||the id of the dag you want to trigger|||,
  conf = |||optional dictionary of info to pass on|||,
  wait_for_completion = |||boolean if you want to wait|||,
  logical_date = '{{ ds }}', #to keep same date
  reset_dag_run = True # allows overwriting on same date
  )

If you don’t set reset_dag_run to True, the downstream DAG will only be triggered a max of once per day

Option 3: Assets

The recommended and more modern method of handling these sorts of dependencies is to utilize and define assets
An assets is an abstract concept here, and could be anything: an S3 bucket, a database table, a JSON file, etc
The idea is that we add logic to our DAGs and tasks that inform Airflow of whether any task requires a specific asset or updates/creates a given asset
Once we make Airflow aware of this, we can then schedule DAGs to only run when certain assets have been updated

Creating Assets

Each asset is just labeled with a URI string
- There is nothing magical about this URI string. It is not being used for anything other than to uniquely identify the asset

Several syntactical ways to create, but I find the most transparent to be:

from airflow.sdk import Asset

m1_fact_table = Asset(
  uri="postgres://localhost:5432/wh.public.m1fact", 
  name='m1fact')
s3_bucket = Asset('s3_bucket')

Outputting Assets

Once you have created an asset, you can reference it elsewhere

To specify that a given task updates or creates a given asset, add it to an outlets list

@task(outlets=[s3_bucket])
def write_to_s3():
  |||body of code|||

or, for classic operators,

mytask = SQLExecuteQueryOperator(
  |||usual keywords and corresponding values|||
  outlets = [m1_fact_table]
  )

Consuming Assets

You can also specify that a given task consumes or requires a particular task

Just use the inlets keyword

@task(inlets=[s3_bucket])
def read_from_s3():
  |||body of code|||

Setting up a inlet to a task does nothing as far as your code is concerned. It just helps tools that track lineage to have a record of what tasks touch or work with different blocks of data. It is thus more optional than outlets, but recommended.

Listening for Assets

Now suppose you have a DAG that you only want to run once an particular asset has been updated or created
- Maybe a dimension table that you only want to update once the data lake is updated

In the desired dag, just change the schedule:

@dag(
  start_date = pendulum.datetime(2025,7,16),
  schedule = [asset1] #include as many as you want
)
def your_dag():
  |||normal dag code|||

Scheduling Details

schedule = [asset1, asset2] would mean that the DAG will wait until both assets are updated before it runs
You can actually add more logic to this if you want:
- schedule = (asset1 | asset2) would trigger when either asset1 or asset2 were updated
- schedule = (asset1 & (asset2 | asset3)) would trigger once asset1 and either asset2 or asset3 were updated

Final Assessment

Assets represent the state of something at a particular point in time
- This means that if you update an asset as part of a backfill, and such an update triggers a downstream DAG, the logical backfill date would be used for the downstream DAG as well
- This is really nice, but does mean that backfills that touch an asset will trigger all downstream DAGs. Make sure you want that
  - If you need to do a backfill and don’t want that, you can deactivate the downstreame DAG, run the backfill, and then reactivate it
Assets are just Airflow bookkeeping! If an outside source modifies a table or bucket, nothing is going to trigger!
- This also includes if you forget to set an outlet within Airflow

Your Time!

For Next Week

Milestone 4 has been up and the Neo4j database live!
Priorities:
- Getting snapshots set up for the new database if you haven’t already (and you are behind if that is the case)
- Figure out your dimensional model for Milestone 4
- Work on DAG(s) to populate that model
- Ensure your documentation is up to date
Milestone 4 is due next Tuesday
- Next transfer happens next Wednesday, so dashboards and documentation must be in on time (or at least by noon on Wednesday)