Measuring Events

Jed Rembold

July 30, 2025

Announcements

  • Milestone 5 due next Tuesday night!
  • One last optional milestone after that, whose score will replace your lowest other milestone grade
  • Don’t forget your weekly reflections today (if you haven’t already for this week)!

Tonight

  • Some Event Backtracking
    • Kafka’s Group ID is brittle for data pipelines. How can we do better?
  • Metrics
    • What are metrics?
    • Prometheus
      • Postgres Exporter
      • Node Exporter
      • Airflow Exporter
  • Project Work

Event Backtracking

The Problem

  • Last week when introducing Kafka, we discussed the importance of offsets
  • Utilizing a group id and auto-commiting was put forth as a method to ensure you didn’t double count events
  • This is simple, and works
  • But it is fragile
    • What happens if you consume events but then some downstream task fails?
    • What happens if you want to backfill?
  • So what other options do we have?

Manual Offsets

  • To get full flexibility, we have to accept responsibility to manage the offsets ourselves
  • We would thus read in events only with offsets in a particular range during a snapshot
  • Options:
    • Track read event offsets in a database somewhere
    • Look up desired offsets from timestamps
  • Given that our snapshots are usually timestamp bracketed anyway, the second option probably makes the most sense

Deactivate Autocommit

  • To get started, in our config we will want to deactivate confluent_kafka’s default automatic committing

  • This ensures that we are purely in control of what offset(s) we begin reading events from

    config = {
      'bootstrap.servers': |||your server|||,
      'group.id': |||your group|||,
      'enable.auto.commit': False
    }

Partitions

  • To request specific offsets, we also need to explicitly account for the presence of partitions
    • Recall that events from a particular topic might be spread across several partitions to distribute load
  • To ensure we get all the events, we need to explicitly request events from each different partition
  • How do we know what partitions exist?
    • We can request that metadata from the consumer
    con = Consumer(config)
    metadata = con.list_topics(topic=|||your topic|||, timeout=10)
    partitions = metadata.topics[|||your topic|||].partitions

Offset by Time

  • Now, for each partition, we need to figure out which offsets we want, ideally based on a timestamp
  • The Consumer class has a method exactly for this, called .offsets_for_times
    • Takes as an argument a TopicPartition object (or list of them)
      • These essentially describe a single partition of a topic (they are aptly named)
    example = con.offsets_for_times(
      TopicPartition(|||your topic|||, |||chosen partition|||, |||kafka timestamp|||),
      timeout = 10)
  • It returns another TopicPartition object, but one in which the timestamps have been converted to the equivalent offset that immediately follows the given timestamp

Kafka Timestamps

  • Don’t forget that Kafka uses timestamps in a Unix format, but in milliseconds

  • If you have a pendulum DateTime object then, you need to convert to a timestamp and then multiply by 1000

    normal_time = pendulum.now()
    kafta_time = int(normal_time.timestamp() * 1000)
  • DateTime objects usually track to the microsecond, so you’ll have some decimals left over that you should truncate with int

Extracting Offsets

  • offsets_for_times returns to you a single (or list of) TopicPartition(s)

  • In some cases, that is exactly what you want

  • In other cases, you may want to just extract the offset(s) from the TopicPartition

  • Doing so is simple:

    desired_offset = example.offset
  • You can of course loop over a list of TopicPartitions to extract all the offsets

Assigning Things

  • Recall previously that we told the consumer to subscribe to a particular topic
  • Then handling things manually, we instead assign a particular TopicPartition to the consumer
    • This might be a single TopicPartition or it might be a list of TopicPartitions
    |||consumer|||.assign(|||your single or list of topic partitions|||)
  • Past this, the poll loop looks very similar

Overall Gameplan

  • For snapshots then, you might follow the below general algorithm:
    • Determine all the partitions utilized by the desired topic
    • Determine the start and end times of the snapshot in a Kafka timestamp
    • Use offsets_for_times to generate TopicPartition(s) for both the start and end times
      • This effectively gets you the starting offset and the ending offset for each partition during that day. These are what you want to loop over and snapshot
    • Extract the ending offsets from the TopicPartition. The starts are already good to go.
    • Record or save what the ending offset is for each partition
    • As long as you haven’t read in all the desired offset events, keep looping and polling the consumer
      • Only “save” messages that are in the correct range of offsets
      • When you reach the end of a particular partition, remove it from your list/dictionary/set of desired offset_ranges
    • Union the new data with any previous snapshot (if it exists)

Tricky Edge Cases!

  • Partitions within a topic are append-only and immutable, so no data would ever change, and could only be added to
  • However, topics usually have retention-policies!
  • What happens if you try to take a snapshot of a day when all events have been retired due to retention?
    • Your starting and ending offsets would give the same value! You should probably not loop or create a new snapshot at all in that case.
  • What happens if you try to manually take a snapshot halfway through a day?
    • The starting and ending offsets would be different, so you’d try to snapshot
    • But you’d have lost half your events, and overwriting your old snapshot would lose you data!
    • This is why the UNION step is important

Break!

Metrics

Observability

  • For the remainer of the semester I want to spend some time dwelling on topics of observability
  • These are topics that revolve largely around monitoring and understanding what your pipeline is doing
    • Many of these concepts can be data sources in and of themselves, though we will focus mostly on their ability to help us manage and understand our complex pipelines
  • Why now?
    • I think most everyone has reached a level of complexity in their projects where they could start appreciating what good observability might lend them
    • It exposes you to some concepts and software that might still be a data source you’d see in the future

The Triangle of Observability

Logs Why is it happening? Metrics What is happening? Traces Where is it happening?

A Bit on Traces

  • Of these, the one that we will spend the least time with in this class is traces
  • A trace is a selective monitoring of a particular event and any cascading events that result
  • Generally impossible to trace everything, and so only a representative sample of the key events are followed and recorded
  • Looking at an aggregate of these though can still inform about how information is moving through a system
  • Can also be useful for troubleshooting, to identify where things have gone wrong or where bottlenecks are located

Metrics

  • For the rest of today, we want to focus on metrics
  • A metric is a numeric count or aggregation of events that are occurring
    • E.g. Visits to a website, clicks of a button, DAG run time
  • These usually have a focus on real-time feedback, so that you could quickly identify if something is wrong from various metrics
  • What metrics might help you monitor your data pipelines?

The Gift of Fire

  • If you have a lot of systems, collecting metrics is almost a full data-pipeline in itself, just simpler
  • Generally push and pull models are available
  • Easily the biggest open-source player at the moment is Prometheus
  • Prometheus operates on a pull model, where it periodically polls connected clients to access their latest metrics
  • Stores everything in a time-series database, which it then makes available for downstream users (hello Grafana)

Prometheus Visualized

Exporters

  • Once Prometheus itself is setup, most of the effort comes from the exporters
  • These make various metrics available to Prometheus, usually through an API endpoint
  • These could be handwritten, for bespoke software
  • But for existing software, tons of exporters already exist!
    • Node exporter for metrics about the computer or VM
    • Postgres exporter for metrics about a Postgres database
    • Statsd exporter for Airflow metrics
    • And many more!
  • Many bits of software also offer native Prometheus metrics
    • MiniIO, for instance, has a metrics endpoint that Prometheus can read directly from

Prometheus Setup!

  • As per usual, we can handle this through Docker and Docker Compose
    • You may want to add your various exporters to the same Docker Compose file, as deemed appropriate
services:
  prometheus:
    image: prom/prometheus
    ports:
      - 9090:9090
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prom_data:/prometheus
    user: 1000:1000

Configuring Prometheus

  • Because Prometheus pulls data to it, it needs a configuration file to tell it what it should be pulling and how often

  • This should be in the same folder as your docker-compose.yml file (at least as set up there)

  • A highly basic configuration might initially look like:

    global:
      scrape_interval: 15s #How frequently to scrape
      evaluation_interval: 15s # for rules/alerts
    
    scrape_configs:
      - job_name: 'prometheus' # Name of the scraping job
        static_configs:
          - targets: ['localhost:9090'] # API endpoint 

Exporter 1: The Node Exporter

  • Sometimes you just want to be able to monitor the overall system (or systems) where code in running
  • This is where the node exporter comes in!
  • We can run it in a container, and thus just add it to our docker stack
    • Does require a few extra settings to ensure we get the system metrics, not the container metrics
    services:
      node-exporter:
        image: quay.io/prometheus/node-exporter
        command:
          - '--path.rootfs=/host'
        network_mode: host
        pid: host
        restart: unless-stopped
        volumes:
          - '/:host:ro,rslave'

Exporter 1: Adding to Config

  • All launching the exporter container does is make the metrics available, we still need to tell Prometheus to grab these new values

  • Under scrape_configs, just need to add a field:

    ...
    scrape_configs:
      ...
    
      - job_name: 'vm_node'
        static_configs:
          - targets: ['example.advde:9100']

Exporter 1: Visualization

  • We can access this information directly through the Prometheus dashboard on port 9090
  • But an even nicer approach is to take advantage of Grafana
  • In your Grafana instance:
    • Add a new data source, and select Prometheus
    • Enter in the URL, but nothing else needs to change (test it! It should work!)
  • Now, we could build our own dashboard with this information, but an even nicer use-case is to leverage something someone else has made
    • Go to dashboards, select new and then select Import
    • In the Grafana Dashboard ID field, enter 1860 and Load!

Exporter 2: The Postgres Exporter

  • What if we want more details about specifically what is happening in our database?

  • There exists a Postgres exporter!

    services:
      postgres-exporter:
        image: prometheuscommunity/postgres-exporter
        environment:
          DATA_SOURCE_NAME: "postgresql://user:pass@host:port/dbname?sslmode=disable"
        ports:
          - 9187:9187

Exporter 2: Adding to Config

  • Don’t forget to update the Prometheus config!

    ...
    scrape_configs:
      ...
    
      - job_name: 'postgres'
        static_configs:
          - targets: ['example.advde:9187']

Exporter 2: Visualization

  • There is a pre-built dashboard for the Postgres-Exporter as well, but it requires a bit of adaption
  • I have made the necessary tweaks and offer the configuration JSON to you here
  • You can still go to create a new dashboard, import, but then either copy and paste in the JSON or upload the file

Exporter 3: Airflow

  • Getting Airflow metrics into Prometheus requires a bit of a middle step

  • Airflow has built in ways of exporting metrics to a system called StatsD

  • We can effectively turn this on, and then set up a statsd-exporter that makes metrics available in a Prometheus format

  • Easiest way to turn on is to set 3 environment variables in Airflow’s docker-compose.yml:

    AIRFLOW__METRICS__STATSD_ON: 'true'
    AIRFLOW__METRICS__STATSD_PORT: 9125
    AIRFLOW__METRICS__STATSD_HOST: 'example.advde'

Exporter 3: StatsD Exporter Setup

  • The StatsD Exporter benefits from a cheat-sheet for how it can translate from statsd to Prometheus names

    • Grab from here and copy/paste or download and move to statsd_mapping.yml in same location as prometheus.yml
  • Then we can set up the exporter in Docker Compose

    services:
      statsd-exporter:
        image: prom/statsd-exporter
        ports:
          - 9102:9102 # Where to access metrics
          - 9125:9125 # Incoming metrics
          - 9125:9125/udp
        volumes:
          - ./statsd_mapping.yml:/tmp/statsd_mapping.yml
        command:
          - '--statsd.mapping-config=/tmp/statsd_mapping.yml'

Exporter 3: Prometheus Config

  • There is nothing magical about the new Prometheus config:

    ...
    
    scrape_configs:
      ...
      - job_name: 'airflow'
        static_configs:
          - targets: ['example.advde:9102']

Exporter 3: Visualization

  • Unfortunately, all of the dashboards I’ve found for this so far are either broken or terrible
  • I’ll see if I can’t put one together for next week that is more useful

Work Time!

Your Turn

  • The rest of the evening is set aside for you to work on your projects
  • Deadline is Tuesday!
  • Make sure before you leave tonight you, at the very least have reliable event snapshots occurring
// reveal.js plugins // Added plugins