---
title: "Monte Carlo Markov Chains"
author: Jed Rembold
date: April 8, 2025
slideNumber: true
theme: tokyo-night-light
highlightjs-theme: tokyo-night-light
width: 1920
height: 1080
transition: slide
p5js: true
---


## Announcements
- Homework
  - Homework 5 posted! You should be good to go on Prob 1 and Part A of Prob 2 after today
  - I'm working on HW3 feedback
- Quiz 2 on Exoplanets and Galaxies at end of class on Thursday
  - Study guide is posted!


## Recap
- When it comes to fitting models, the best-fitting parameters may tell an incomplete story
- Often, we also want to know the error or uncertainty in our parameters
  - Methods like least-squares can do this, but only reliably for narrow use-cases
  - We'd like other methods to get at similar information
- Random or "Monte Carlo" methods can help deal with high-dimensional data
  - Errors for this type of data otherwise increase exponentially for most numeric methods

## Sampling from a PDF in R
- Last class we saw a Python function to sample from a given probability distribution
- We can of course build something similar in R:
  ```{.r style='font-size:.9em'}
  sample_from_prob_dist <- function(pdf, lower, upper, num) {
    samples <- c()
    while (length(samples) < num) {
      x <- runif(1, lower, upper)
      p <- pdf(x)
      if (runif(1) <= p) {
        samples <- c(samples, x)
      }
    }
    return(samples)
  }
  ```

## An Unknown PDF
- What if we don't know or have a PDF to work with?
- The _perfect_ PDF would look just like the function we are trying to integrate, but scaled to have area = 1
  - But if we could easily calculate the area, we wouldn't be doing this integration in the first place
- Instead we will construct a process to semi-randomly "walk" it's way around our initial function
  - Goal is to have it spend more of its time near "higher" parts of the function
  - Then we'll look at the distribution of where it spent time

## Today
- Sampling data using a Markov Chain
  - Accepting or rejecting based on Metropolis-Hastings
- Applications to model fitting

<!--
## Example 2
:::{.block name=Previously...}
Now suppose we wanted to evaluate the area under the piece-wise function:
$$ f(x) = \begin{cases}0 & x \leq 4;\\ 5x-20 & 4 < x \leq 5;\\ -5x+30 & 5 < x \leq 6;\\ 0 & x > 6;\end{cases} $$
from $x=0$ to $x=10$, using Monte Carlo methods. Check into both uniform sampling and sampling targeted about the central peak.
:::
- Uniform sampling worked here, but our targeted sampling was struggling. What went wrong?


## Issue #1
- **Make absolutely sure your probability distribution function (pdf) actually integrates to 1!**
- Previously, I had equated a triangle with height 0.5 to thus have slopes of 0.5, and -0.5
  - This is, of course, not necessarily true
- A valid pdf:
  $$
  p(x) = \begin{cases}
            0, & x <= 3 \\
            0.25x - 0.75, & 3 < x <= 5 \\
            -0.25x + 1.75, & 5 < x <= 7 \\
            0, & x > 7
          \end{cases}
  $$

## Issue #2
- Trying to select from a continuous probability distribution discretely is not ideal
  - But it is what is immediately available though numpy
- For a better option, roll your own selector:
  ```{.python style='line-height:1.2em'}
  def sample_from_prob_dist(pdf, lower, upper, num):
      samples = []
      while len(samples) < num:
          x = np.random.uniform(lower, upper)
          p = pdf(x)
          if np.random.uniform(0,1) <= p:
              samples.append(x)
      return np.array(samples)
  ```


## Returning to Example 2
Now suppose we wanted to evaluate the area under the piece-wise function:
$$ f(x) = \begin{cases}0 & x \leq 4;\\ 5x-20 & 4 < x \leq 5;\\ -5x+30 & 5 < x \leq 6;\\ 0 & x > 6;\end{cases} $$
from $x=0$ to $x=10$, using Monte Carlo methods. Sample from the probability distribution defined as:
  $$
  p(x) = \begin{cases}
            0, & x <= 3 \\
            0.25x - 0.75, & 3 < x <= 5 \\
            -0.25x + 1.75, & 5 < x <= 7 \\
            0, & x > 7
          \end{cases}
  $$
-->


# Chaining Together a Walk

## Markov Chains
- A Markov chain is a model that represents a sequence of events where the probability of the next event is determined only by the current event
- In our case, the point we are currently "at" will make it likely that our next point is nearby
  - However, we don't just want a totally _random_ nearby point
- We want to generally walk "uphill", because that's where the important "stuff" is, but we don't want to **always** go uphill
  - Why not?
    - We could get "stuck" in a local maximum
    - Our whole goal is to "sample" the space, not directly optimize it. A bit of wandering is a good thing!


## Metropolis-Hastings
:::incremental
- There are a variety of potential algorithms, but we'll follow the work of Metropolis and Hastings
- Our process will be as follows:
  #. Start a list of all the visited points (initially empty)
  #. Choose a starting point, and add it to the visited list
  #. For some number of iterations:
     - Randomly choose point nearby to your current location
     - Compare that point to your current point, to see if it would move you "higher" up the function
     - Randomly select a number and compare to a threshold. If the the number is smaller, move to that point, otherwise revisit your current location.
       - Add your destination to your visited list
:::

## Choosing a Nearby Point
- Randomly choosing a nearby point inherently pulls from a probability distribution centered on the current point
- Most frequently used is a Gaussian probability distribution
  - Centered (mean) on the current point
  - Spread (std) can be tuned, but 1 is a reasonable starting value
- Working in higher dimensions? A multivariate Gaussian will still work!
  ```python
  np.random.multivariate_normal(mean=VEC, cov=MAT)
  ```
  ```r
  MASS::mvrnorm(n, mu=VEC, Sigma=MAT)
  ```

## Move onward or stay?
- Once you have the new point, you need to decide whether to keep it or your current point
- In Metropolis-Hastings, you:
  - Take the ratio of your sampled function evaluated at the new point over the sampled function evaluated at the current
    $$ \frac{f(x^\prime)}{f(x)} $$
    where $x$ is your current position and $x^\prime$ your proposed new position
  - Compare that ratio of a randomly generated number between 0 and 1
    - If bigger, keep the new position, making it your new current
    - If smaller, keep the old position (but still add it to your visited list again)

## Hand Example
::::::cols
::::col
- It can be instructive to see how this works by hand in a simple example
- Suppose we want to sample the 1D function space to the right
- We will start at the point x = 1.5
- For convenience, say we randomly move in steps of 1
::::

::::col

![](../images/metropolis_hastings_hand_example.svg)

::::
::::::


## Backing out a PDF
- The history of the random walk contains information about where the walker spent the most time
  - Due to how we constructed the walk, this **should** be near the maximum portions of our function
- Creating a histogram or KDE plot will allow us to visualize the probability distribution.
- We could use the histogram information to directly sample from the distribution


## Example: Walking the Function
- Last class we were looking at the piece-wise function described by:
  $$
  f(x) = \begin{cases}
  0 & x < 3; \\
  -\frac{4}{5}x + 4 & 3 \leq x \leq 5; \\
  0 & x > 5 \end{cases}
  $$
- Let's "walk" this function according to Metropolis-Hastings to generate a PDF of the same shape


# Back to Model Fitting

## Model Fitting and Baye's Theorem
- Here we are not as concerned with the best fit
- Concerned instead with our confidence about our fit parameters
  - "What was the probability of getting these parameters given this data?"
- Baye's Theorem provides a way to compute this probability
  $$ P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)} $$
  where $\theta$ represents our fit parameters and $D$ represents the data
- For our purposes, viewing this through a Bayesian lens will be more informative


## Baye Life

![](../images/bayes.svg)


## The Breakdown (Part I)
- The **Prior** is the probability of the parameters without any consideration of the data
  - This generally reflects any knowns or assumptions you are making about the parameters
  - Are they all positive? Are they bounded in some way?
- The **Likelihood** is the probability of getting the data given the parameters
  - For model fitting, this is where we compare the actual data to the predicted data by our model
  - The better the match, the greater the likelihood


## The Breakdown (Part II)
- The **Evidence** is the probability of the data being the way it is
  - This is _extremely_ hard to measure, and also largely pointless for our use-case
  - It doesn't depend on the parameters, so would just be a constant scaling factor
- The **Posterior** is the probability of the parameters given the data
  - This is what we want!
  - "How likely are these parameters given our data?"

## The Big Picture
- Our goal here is to sample the right-hand side of Baye's theorem using MCMC
- The resulting distribution would also describe the probabilities on the left-hand side
  - At least within a scaling factor
- What we are really interested in though is the _spread_ of the posterior probability distribution, so this scaling is of no consequence to us

## Logging
- We can get a pretty huge dynamic range when computing the values of the right-hand side
- Recall that we compute these for each step to see if we take the random step or not
- To avoid computational overflow/underflow errors, it can be recommended to work in ln-space instead
- Here, the accept/rejection step becomes:
  $$\frac{f(\theta^\prime)}{f(\theta)} > r \quad\rightarrow\quad \ln f(\theta^\prime) - \ln f(\theta) > \ln r $$


## The Prior
- Defining the prior pdf is usually straightforward
- If using an unbounded, flat, prior, then it should just return 1 always (0 in ln-space)
- If bounded, check the parameters and return 1 (0 in ln-space) if within the bounds or $-\infty$ otherwise (`-np.inf` in Python, `-Inf` in R)
- Pseudo-example:
  ```mypython
  |||function ln_prior(|||params|||)|||
    if |||illegal condition|||
      return -|||infinity|||
    return 0
  ```

## The Likelihood
- The likelihood essentially compares the data our model would output to our actual data
- The goal is to minimize the differences between the two
  - Additionally, things are normally scaled by known errors, so that values with more error have less weight
- As such:
  $$ P(\theta | D) \propto \exp(-\chi^2(\theta) / 2) $$
  where
  $$ \chi^2(\theta) = \sum_{i=1}^N \frac{(y_{i,D} - y_{i,\theta})^2}{\sigma_{i,D}^2} $$

## The Likelihood (In Code)
- The ln-likelihood then could look like:
  ```{.mypython style='font-size:.9em'}
  |||function ln_likelihood(|||params, data|||)|||
    m,b = params
    x,y,errY = data # extract data
    y_model = m * x + b # compute model result
    return - 0.5 * |||sum(|||
      (y - y_model) ** 2 / errY ** 2
    |||)|||
  ```

## All together now...
- Bring both pieces together (since we don't care about $P(D)$):
  ```{.mypython style='font-size:.9em'}
  |||function ln_pdf(|||params, data|||)|||
    p = |||call ln_prior|||
    if p == -|||infinity||| # no sense continuing
      return -|||infinity|||
    return p + ln_likelihood(params, data)
  ```

<!--
## Quiz Time!
- The remaining time is set aside for Quiz 2
-->