Fitting

Jed Rembold

April 10, 2023

Announcements

HW5 should hopefully be coming out over the coming weekend
- Things have been a bit slowed by my travel
I almost got through HW2 feedback before I left, but not quite. I’m so sorry.
If my current connection does not allow a decent stream, then I’ll make a recording and post it.
On Tuesday we will have another remote lecture

Recap

Rotation or velocity curves measure the speed of orbiting objects as you move outwards from a central body
- Commonly assume circular or largely circular orbits
- Generally measured through Doppler shifts
The shape of a rotation curve tells you about the distribution of material throughout the region
The rotation curves of galaxies do not match what we would expect, thus leading to conjectures about dark matter

Discussing Today

What do we want from a model fit?
Why might we need methods beyond least-mean squares?
How can randomness help us determine the output of integrals?

Why fit data?

A “fit” model computes the values of unknown constants within the model
In science, and in astronomy in particular, such models are usually derived from fundamental first principles, and thus those constants are linked to physical properties or laws
- The T when we fit Planck’s Law, for instance
Individual fits themselves may still not be that important: rather, they can contribute to an overall story that the data is telling
- “Hotter things emit more light and bluer light”

Fitting Neglect

In the fitting that we have done so far this semester, we have been leaving some important things out
- What role does any known error or variance play in our fitting of data?
- How confident should we be in our fit parameters? That is, what variance should we expect in the fit parameters?
All of these things can sometimes be estimated easily from current techniques, and can sometimes have easy analytic results.
- But what about in more complicated situations?

A seeming aside: Solving Integrals

Suppose you want to find the area under a function \(f(x)\) between \(x=a\) and \(x=b\), such that you are interested in \[ \int_a^b f(x)\,dx \]
If \(f(x)\) is simple and you know a bit of calculus, this is relatively straightforward
But what if \(f(x)\) is some terribly complicated function that you don’t know how to integrate?
- Or worse, what if \(f(x)\) is a largely black-box, wherein you put in an input and then some time later get an output, without seeing what happens in the middle?

Alternative Approaches

When posed with an intractable integral, most folks these days look to:
- Wolfram Alpha, or some other computation engine, for aid with analytic integration
- Numerical techniques
Numerical techniques rely on breaking the integral up into simple approximations
- The key here is that they are always approximations (midpoint rule, trapezoid rule, Simpson’s rule, etc.)
- As approximations, there will always be some inherent error associated with using these methods

Entering the Multiverse

For nice one-dimensional problems, these methods can work out nicely
But what if our function was instead dependent on, for instance, 6 parameters? \[ f(x,y,z,a,b,c) \quad\rightarrow\quad \int\int\int\int\int\int f(x,y,z,a,b,c)\,dx\,dy\,dz\,da\,db\,dc \]
The trapezoid rule is a decent approximation method, but if you work it out, its error scales as: \[ \epsilon \propto \frac{1}{N^{2/d}} \] where \(N\) is the number of sample points (how skinny are the rectangles) and \(d\) the number of dimensions being integrated over

Compiling Errors

In order to keep a constant error value then as the number of dimensions increase: \[ N = (1/\epsilon)^{d/2} \]
This is highly problematic for higher dimensional data, as we need exponentially more data to keep the errors at a given level!
- We need to compute the area of each of those trapezoids, so exponentially more trapezoids means exponentially longer running time

A Random Solution

What if, instead of slicing up the entire region of interest, we instead randomly sampled from that region
Computing, for instance, just the area at a given, randomly selected point
It turns out this error scales as: \[ \epsilon \propto \frac{1}{\sqrt{N}} \] which no longer has any \(d\) dependence!!
This is the basis behind random (or Monte Carlo) selection methods

Random Areas?

How do random points get us an accurate portrayal of the area under the curve?
- At each random point, compute the value of the function at that point, this will be the height of your rectangle
- The width of your rectangle is just the span of your boundaries
Some points will overestimate the area, and some will underestimate the area
The amazing aspect is that given enough points, the average of the point areas will closely approximate the actual area under the curve

Example

Suppose we wanted to evaluate the area under the curve \[ f(x) = -\frac{4}{5}x + 4 \] from \(x=0\) to \(x=5\), using Monte Carlo methods. This describes a triangle with height 4 and width 5, and thus should have an area of 20. Is that what we get?

Example 2

What if we slightly complicate this function by looking instead at a piece-wise function: \[ f(x) = \begin{cases} 0 & x < 3; \\ -\frac{4}{5}x + 4 & 3 \leq x \leq 5; \\ 0 & x > 5 \end{cases} \] This now describes a triangle with height of 1.6 and width of 2, thus having an area of 1.6. Does the Monte Carlo method still work?

Probability Distributions

A probability distribution describes the likelihood of selecting a particular value over a range of values.
The area under the probability distribution must sum to 1, so the odds add up.
In our current examples, any number between 0 and 5 was equally likely, which would have led to the probability distribution to the right.

Importance

For some functions, all points might not be equally interesting, and thus a constant probability distribution makes no sense.
We instead might want to weigh certain regions higher in our random selection, as that is where the interesting things are happening
If we do so, we need to recognize that in computing our areas, we are now doing \[ A = \frac{f(x_k)}{p(x_k)} \] where \(p(x_k)\) is the probability of selecting a particular \(x\)
- This is the same as what we were doing before, since for a uniform sampling, the probability weights looked like 0.2, and 1/0.2 is 5.

Selecting from a probability distribution

Trying to select from a continuous probability distribution discretely is not ideal, but that is all that is immediately available through Numpy

For a better option, roll your own selector:

def sample_from_prob_dist(pdf, lower, upper, num):
    samples = []
    while len(samples) < num:
        x = np.random.uniform(lower, upper)
        p = pdf(x)
        if np.random.uniform(0,1) <= p:
            samples.append(x)
    return np.array(samples)

Example 3

Now suppose we wanted to evaluate the area under the same piece-wise function as earlier: \[ f(x) = \begin{cases} 0 & x < 3; \\ -\frac{4}{5}x + 4 & 3 \leq x \leq 5; \\ 0 & x > 5 \end{cases} \] But this time choosing points based on two different probability distributions: \[ {p(x) = \begin{cases} 0 & x < 3; \\ 0.5 & 3 \leq x \leq 5; \\ 0 & x > 5 \end{cases}}\qquad {p(x) = \begin{cases} 0 & x < 3; \\ \frac{1}{2}\left(x - 3\right) & 3 \leq x \leq 5; \\ 0 & x > 5 \end{cases}} \]

An Unknown PDF

What if we don’t know or have a PDF to work with?
The perfect PDF would look just like the function we are trying to integrate, but scaled to have area = 1
- But if we could easily calculate the area, we wouldn’t be doing this integration in the first place
Instead we will construct a Markov Chain to semi-randomly “walk” it’s way around our initial function
- Goal is to have it spend more of its time near “higher” parts of the function
- Then we’ll look at the distribution of where it spent time

Markov Chains

A Markov chain is a model that represents a sequence of events where the probability of the next event is determined only by the current event
In our case, the point we are currently “at” will make it likely that our next point is nearby
Our process will then follow that of Metropolis-Hastings and is shown below. Note that you must keep track of the visited points along the way!
- Choose a starting point
- For some number of iterations:
  - Randomly choose a nearby point
  - Compare that point to your current point, to see if it would move you “higher” up the function
  - If above a random threshold, move to that point, otherwise stay where you are at

Choosing a Nearby Point

Randomly choosing a nearby point inherently pulls from a probability distribution centered on the current point
Most frequently used is a Gaussian probability distribution
- Centered (mean) on the current point
- Spread (std) can be tuned, but 1 is a reasonable starting value
```
np.random.normal(loc=0, spread=1)
```
Working in higher dimensions? A multi-variate Gaussian will still work!
```
np.random.multivariate_normal(mean=VEC, cov=MAT)
```

Move onward or stay?

Once you have the new point candidate, you need to decide whether to keep it or your current point
In Metropolis-Hastings, you:
- Take the ratio of your sampled function evaluated at the new point over the sampled function evaluated at the current \[ \frac{f(x^\prime)}{f(x)} \] where \(x\) is your current position and \(x^\prime\) your proposed new position
- Compare that ratio to a randomly generated number between 0 and 1
  - If bigger, keep the new position, making it your new current
  - If smaller, keep the old position (but still add it to your visited list again)

Backing out a PDF

The history of the random walk contains information about where the walker spent the most time
- Due to how we constructed the walk, this should be near the maximum portions of our function
Creating a histogram or KDE plot will allow us to visualize the probability distribution.
We could use the histogram information to directly sample from the distribution