Jed Rembold
July 16, 2025
Cypher actually has syntax for extracting the shortest path by connection between two points:
MATCH p = SHORTEST 1 (A:Loc)-[:LINK*]-(B:Loc)
WHERE A.name = 'Start' AND B.name = 'End'
RETURN [n in nodes(p) | n.name] AS stopsInstead of SHORTEST 1 to get just a
single path, you can use ALL SHORTEST to get
all shortest paths, in the case of ties
If you just need to query a single pair of nodes, this is all you need!
NEO4J_PLUGINS: ["graph-data-science"]stream just lets you retrieve the
results as a query resultmutate updates the in-memory graph with
the output of the algorithmwrite updates the Neo4j database with
the output of the algorithmThere are a few ways you can project a graph from a Neo4j database, but the easiest is what is called the native projection:
CALL gds.graph.project(
graphName,
nodeProj,
relationProj,
{ relationshipProperties: ["weight_property"]}
)Once you have projected the graph, you can compute all the distances
CALL gds.allShortestPaths.stream(
graphName,
{ relationshipWeightProperty: "weight_property"}
) YIELD sourceNode, targetNode, distance
RETURN
gds.util.asNode(sourceNode).name AS source_name,
gds.util.asNode(targetNode).name AS target_name,
distanceNote that this only gets you the distances, not the paths!
To get the actual paths, you’d have to loop over the pairs and query
Regardless, to remove the in-memory graph you created:
CALL gds.graph.drop(graphName)import networkx as nx
G = nx.Graph()G = nx.DiGraph()You can add nodes individually:
G.add_node(|||node label|||, |||property keywords and values|||)You can add edges:
G.add_edge(|||start node label|||, |||end node label|||, |||properties|||)
You can import from a pandas edgelist
G = nx.from_pandas_edgelist(
|||df|||, |||source col|||, |||target col|||, |||list of property columns|||
)
Once you have created the graph, you can access various nodes and edges fairly easily
Getting nodes (by label)
des_node = G.nodes[|||node label|||]Getting an edge (by adjacent nodes)
des_edge = G[|||node 1|||][|||node 2|||]You can loop over all the possible nodes or edges using
G.nodes() or
G.edges()
You can compute the shortest python between two points:
nx.shortest_path(
|||your graph|||,
source=|||start label|||,
target=|||end label|||)
weight keyword that
references an edge propertyYou can also compute them all at once:
nx.all_pairs_shortest_path(|||your graph|||)
weight keyword. Use
all_pairs_dijkstra_path instead for
thatshortest_path_length
all_pairs_shortest_path_length and
all_pairs_dijkstra_path_length exist as
wellYou have WKT strings in the Neo4j DB containing exact routes between stations
If you want to do something with them in Python, some functions are useful
from shapely.wkt import loads
from shapely.ops import linemergeThe first takes a WKT string and parses it into a format that the shapely library understands
The second can take a list of LineString objects and concatenate them into a single LineString
Airflow has a built-in operator called the
TriggerDagRunOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperatorThis is a explict task that you could add at the end of one DAG to explicitly call the next DAG when the current one finishes.
You would then usually unschedule the DAG that is being triggered in this fashion
You can also pass down other information to the downstream DAG should it be necessary
The TriggerDagRunOperator has the
general form of:
next_dag_task = TriggerDagRunOperator(
task_id = |||your chosen id|||,
trigger_dag_id = |||the id of the dag you want to trigger|||,
conf = |||optional dictionary of info to pass on|||,
wait_for_completion = |||boolean if you want to wait|||,
logical_date = '{{ ds }}', #to keep same date
reset_dag_run = True # allows overwriting on same date
)If you don’t set reset_dag_run to
True, the downstream DAG will only be
triggered a max of once per day
Each asset is just labeled with a URI string
Several syntactical ways to create, but I find the most transparent to be:
from airflow.sdk import Asset
m1_fact_table = Asset(
uri="postgres://localhost:5432/wh.public.m1fact",
name='m1fact')
s3_bucket = Asset('s3_bucket')Once you have created an asset, you can reference it elsewhere
To specify that a given task updates or creates a given asset,
add it to an outlets list
@task(outlets=[s3_bucket])
def write_to_s3():
|||body of code|||
or, for classic operators,
mytask = SQLExecuteQueryOperator(
|||usual keywords and corresponding values|||
outlets = [m1_fact_table]
)You can also specify that a given task consumes or requires a particular task
Just use the inlets keyword
@task(inlets=[s3_bucket])
def read_from_s3():
|||body of code|||Setting up a inlet to a task does nothing as far as
your code is concerned. It just helps tools that track lineage to have a
record of what tasks touch or work with different blocks of data. It is
thus more optional than outlets, but
recommended.
Now suppose you have a DAG that you only want to run once an particular asset has been updated or created
In the desired dag, just change the
schedule:
@dag(
start_date = pendulum.datetime(2025,7,16),
schedule = [asset1] #include as many as you want
)
def your_dag():
|||normal dag code|||schedule = [asset1, asset2] would mean
that the DAG will wait until both assets are updated
before it runsschedule = (asset1 | asset2) would
trigger when either asset1 or
asset2 were updatedschedule = (asset1 & (asset2 | asset3))
would trigger once asset1 and either
asset2 or asset3
were updatedoutlet within Airflow