The Shortest Path

Jed Rembold

July 16, 2025

Announcements

  • My only goal over the next two days is getting you feedback
  • Documentation poll link announced earlier today
  • Milestone 4 due Tuesday night
    • Many of you are already going to have a busy Tuesday evening, so plan accordingly
    • Transfer happening next Wednesday afternoon, so stuff needs to be completely done by noon
    • Don’t forget documentation!
  • Don’t forget your weekly reflections today (if you haven’t already for this week)!

Tonight

  • Determining Shortest Paths
    • Neo4j
    • NetworkX
  • Airflow Organization
    • Triggering other dags
    • Assets
  • Work time

Determining the Shortest Path(s)

Connecting Paths

  • One thing that graph databases facilitate is looking at the connecting paths between nodes
    • This can be quite complicated in other databases, requiring explicit looping or huge numbers of joins
  • Looking at the various ways you can get from one node to another is a common problem in mathematical graph theory, and comes up frequently in real world applications
    • E.g. Route finding between two locations on a map, supply chain logistics, power grid optimization, neural connectivity
  • Generally involve two sorts of questions:
    • Is the graph fully connected? Could you get to any node starting from any node?
    • What are the most optimal routes to move between different nodes?

Visualized

5 2 7 5 1 2 2 2 3 4 A B C D E F G H

Shortest Path Algorithms

  • There are many shortest path algorithms
    • Specifically which you use likely would only matter in the largest of graphs
  • Most important is that most any software that models data as a graph will provide some methods for you to compute shortest distance paths
  • The requirement will be that you need the data in a graph form
    • This has some implications for handling snapshots, which are likely no longer in a graph form

Option 1: Shortest Paths within Neo4j

  • One option is to pre-compute all possible shortest paths between each node pair and snapshot that as well
    • Likely only works reasonably for smaller graph networks, as otherwise the amount of possible node pairs gets huge
  • The benefit is that the computation happens on the server where the Neo4j database lives, which may well have more compute available than wherever your Airflow worker runs

Simple Shortest Paths: Neo4j

  • Cypher actually has syntax for extracting the shortest path by connection between two points:

    MATCH p = SHORTEST 1 (A:Loc)-[:LINK*]-(B:Loc)
    WHERE A.name = 'Start' AND B.name = 'End'
    RETURN [n in nodes(p) | n.name] AS stops
  • Instead of SHORTEST 1 to get just a single path, you can use ALL SHORTEST to get all shortest paths, in the case of ties

  • If you just need to query a single pair of nodes, this is all you need!

The Neo4j Graph Data Science Library

  • This syntax though is missing two potentially important requirements:
    • Weighting the paths by some attribute
    • Easily computing all possible pairs
  • For this purpose, Neo4j has what it calls the Graph Data Science library, or GDS
  • The GDS gives access to lower-level, parallelized versions of common graph algorithms
  • Need to activate in the Docker Compose
    • Add the environment variable: NEO4J_PLUGINS: ["graph-data-science"]

Using the GDS

  • The usual GDS workflow looks something like:
    • Project an in-memory graph from the Neo4j database
    • Choose and run the desired algorithm is one of several modes:
      • stream just lets you retrieve the results as a query result
      • mutate updates the in-memory graph with the output of the algorithm
      • write updates the Neo4j database with the output of the algorithm
  • In-memory graphs are only cleaned up automatically when the Neo4j server is reset
    • If you are going to use this route, please clean up after yourself

The Projection

  • There are a few ways you can project a graph from a Neo4j database, but the easiest is what is called the native projection:

    CALL gds.graph.project(
      graphName,
      nodeProj,
      relationProj,
      { relationshipProperties: ["weight_property"]}
    )

The Calculation

  • Once you have projected the graph, you can compute all the distances

    CALL gds.allShortestPaths.stream(
      graphName,
      { relationshipWeightProperty: "weight_property"}
    ) YIELD sourceNode, targetNode, distance
    RETURN
      gds.util.asNode(sourceNode).name AS source_name,
      gds.util.asNode(targetNode).name AS target_name,
      distance
  • Note that this only gets you the distances, not the paths!

Finishing up

  • To get the actual paths, you’d have to loop over the pairs and query

    • This is unavoidably cumbersome, and why I’ll suggest another path if that is what you want
      • A shorter path, if you will…
  • Regardless, to remove the in-memory graph you created:

    CALL gds.graph.drop(graphName)

NetworkX

  • If you are trying to get the shortest path after you have taken data out of the graph database, you are going to need to recreate the graph
  • The NetworkX library in Python is all about setting this up
  • Import as import networkx as nx
    • You may need to add it to your Airflow Dockerfile

Graph Types

  • NetworkX maintains a few types of graphs
  • Undirected graphs have no “arrows” pointing from node to node
    • The connection from node A to node B is the same as from Node B to Node A
    • Created with G = nx.Graph()
  • Directed graphs have directionality from node to node
    • This is technically how Neo4j represents everything: it may or may not matter to your model
    • Created with G = nx.DiGraph()

Populating NetworkX Graphs

  • You can add nodes individually:

    G.add_node(|||node label|||, |||property keywords and values|||)
  • You can add edges:

    G.add_edge(|||start node label|||, |||end node label|||, |||properties|||)
    • Any node labels not created will be automatically created
  • You can import from a pandas edgelist

    G = nx.from_pandas_edgelist(
      |||df|||, |||source col|||, |||target col|||, |||list of property columns|||
    )
    • Makes an undirected graph by default

Accessing Nodes and Edges

  • Once you have created the graph, you can access various nodes and edges fairly easily

  • Getting nodes (by label)

    des_node = G.nodes[|||node label|||]
  • Getting an edge (by adjacent nodes)

    des_edge = G[|||node 1|||][|||node 2|||]
  • You can loop over all the possible nodes or edges using G.nodes() or G.edges()

Finding Shortest NetworkX Paths

  • You can compute the shortest python between two points:

    nx.shortest_path(
      |||your graph|||, 
      source=|||start label|||, 
      target=|||end label|||)
    • You can also add a weight keyword that references an edge property
  • You can also compute them all at once:

    nx.all_pairs_shortest_path(|||your graph|||)
    • This does not support a weight keyword. Use all_pairs_dijkstra_path instead for that

Actual Lengths

  • The previous functions all gave you the sequence of nodes from start to finish
  • You could loop over them to extract the distances and add them yourself for the full length
  • Alternatively, you can use shortest_path_length
    • all_pairs_shortest_path_length and all_pairs_dijkstra_path_length exist as well

An Aside: GIS Shenanigans

  • You have WKT strings in the Neo4j DB containing exact routes between stations

  • If you want to do something with them in Python, some functions are useful

    from shapely.wkt import loads
    from shapely.ops import linemerge
  • The first takes a WKT string and parses it into a format that the shapely library understands

  • The second can take a list of LineString objects and concatenate them into a single LineString

Break Time!

Airflow Organization

Why?

  • At this point, you have likely have at least 5-10 separate DAGs written to manage everything that is going on
  • Some of these DAGs very likely have dependencies on one another
    • For instance, you probably want your DAG to populate your warehouse to run after your DAG to take a snapshot
  • The instant that we start using schedules to account for these dependencies (and we are all guilty of this), we are making our pipeline more frail
  • So how can we account for this otherwise?

Option 1: The Super DAG

  • Perhaps the most obvious option is to collapse multiple DAGs into a single DAG
  • This lets you very explicitly define the dependencies between them
  • The good news about this is that it is fairly easy
  • The bad news is that it:
    • Can rapidly get very confusing with a single monolithic DAG
    • Doesn’t give you the option to run just a portion of the DAG, so you lose flexibility and visibility

Option 2: Explict DAG Run Triggers

  • Airflow has a built-in operator called the TriggerDagRunOperator

    from airflow.operators.trigger_dagrun import TriggerDagRunOperator
  • This is a explict task that you could add at the end of one DAG to explicitly call the next DAG when the current one finishes.

  • You would then usually unschedule the DAG that is being triggered in this fashion

  • You can also pass down other information to the downstream DAG should it be necessary

Triggering Anatomy

  • The TriggerDagRunOperator has the general form of:

    next_dag_task = TriggerDagRunOperator(
      task_id = |||your chosen id|||,
      trigger_dag_id = |||the id of the dag you want to trigger|||,
      conf = |||optional dictionary of info to pass on|||,
      wait_for_completion = |||boolean if you want to wait|||,
      logical_date = '{{ ds }}', #to keep same date
      reset_dag_run = True # allows overwriting on same date
      )
  • If you don’t set reset_dag_run to True, the downstream DAG will only be triggered a max of once per day

Option 3: Assets

  • The recommended and more modern method of handling these sorts of dependencies is to utilize and define assets
  • An assets is an abstract concept here, and could be anything: an S3 bucket, a database table, a JSON file, etc
  • The idea is that we add logic to our DAGs and tasks that inform Airflow of whether any task requires a specific asset or updates/creates a given asset
  • Once we make Airflow aware of this, we can then schedule DAGs to only run when certain assets have been updated

Creating Assets

  • Each asset is just labeled with a URI string

    • There is nothing magical about this URI string. It is not being used for anything other than to uniquely identify the asset
  • Several syntactical ways to create, but I find the most transparent to be:

    from airflow.sdk import Asset
    
    m1_fact_table = Asset(
      uri="postgres://localhost:5432/wh.public.m1fact", 
      name='m1fact')
    s3_bucket = Asset('s3_bucket')

Outputting Assets

  • Once you have created an asset, you can reference it elsewhere

  • To specify that a given task updates or creates a given asset, add it to an outlets list

    @task(outlets=[s3_bucket])
    def write_to_s3():
      |||body of code|||

    or, for classic operators,

    mytask = SQLExecuteQueryOperator(
      |||usual keywords and corresponding values|||
      outlets = [m1_fact_table]
      )

Consuming Assets

  • You can also specify that a given task consumes or requires a particular task

  • Just use the inlets keyword

    @task(inlets=[s3_bucket])
    def read_from_s3():
      |||body of code|||


Setting up a inlet to a task does nothing as far as your code is concerned. It just helps tools that track lineage to have a record of what tasks touch or work with different blocks of data. It is thus more optional than outlets, but recommended.

Listening for Assets

  • Now suppose you have a DAG that you only want to run once an particular asset has been updated or created

    • Maybe a dimension table that you only want to update once the data lake is updated
  • In the desired dag, just change the schedule:

    @dag(
      start_date = pendulum.datetime(2025,7,16),
      schedule = [asset1] #include as many as you want
    )
    def your_dag():
      |||normal dag code|||

Scheduling Details

  • schedule = [asset1, asset2] would mean that the DAG will wait until both assets are updated before it runs
  • You can actually add more logic to this if you want:
    • schedule = (asset1 | asset2) would trigger when either asset1 or asset2 were updated
    • schedule = (asset1 & (asset2 | asset3)) would trigger once asset1 and either asset2 or asset3 were updated

Final Assessment

  • Assets represent the state of something at a particular point in time
    • This means that if you update an asset as part of a backfill, and such an update triggers a downstream DAG, the logical backfill date would be used for the downstream DAG as well
    • This is really nice, but does mean that backfills that touch an asset will trigger all downstream DAGs. Make sure you want that
      • If you need to do a backfill and don’t want that, you can deactivate the downstreame DAG, run the backfill, and then reactivate it
  • Assets are just Airflow bookkeeping! If an outside source modifies a table or bucket, nothing is going to trigger!
    • This also includes if you forget to set an outlet within Airflow

Your Time!

For Next Week

  • Milestone 4 has been up and the Neo4j database live!
  • Priorities:
    • Getting snapshots set up for the new database if you haven’t already (and you are behind if that is the case)
    • Figure out your dimensional model for Milestone 4
    • Work on DAG(s) to populate that model
    • Ensure your documentation is up to date
  • Milestone 4 is due next Tuesday
    • Next transfer happens next Wednesday, so dashboards and documentation must be in on time (or at least by noon on Wednesday)
// reveal.js plugins // Added plugins