---
title: "Logistic Regression"
author: Jed Rembold
date: March 18, 2025
slideNumber: true
theme: tokyo-night-light
highlightjs-theme: tokyo-night-light
width: 1920
height: 1080
transition: slide
---


## Announcements
- HW4 is out!
  - You can already do the first two problems in their entirety!
  - We'll be spending the rest of this week and a day after break getting you the techniques for Problem 3
  - I'm actually bumping the HW4 deadline back to Thursday (Apr 3) because of the need for the day after break.
- I'm 80% of the way through the HW2 feedback
  - Once I finish it, I'll give you 2-3 days of **non-break** time to make any updates to HW3 that you might desire based on the feedback.
  - I'm targeting by the end of Thursday, which I think is actually doable

## Recap
::: {style='font-size:.8em'}
- Galaxies largely form as the result of large regions of mostly hydrogen gas collapsing inwards under gravity
- Galaxies come in three main "flavors":
  - Spiral Galaxies
  - Elliptical Galaxies
  - Irregular
- Classification models commonly use confusion matrices to judge the relative quality of the model
  - There are ways to attach numbers to these for comparison, but the goal is almost never to just optimize a number
- Machine Learning methods are more concerned with predicting, as opposed to statistical methods focusing on inference
- Supervised learning requires labeled data for training, where the categories are already known

:::

## Discussing Today
- Basic data prep in Python and R
- Our first model: Logistic Regression
  - Building, training, and testing models in Python
  - Building, training, and testing models in R

# Prepping the Data
## Training vs Testing
- Because of the iterative approach, many models will, if given enough time, _perfectly_ model the data
  - **THIS IS A BAD THING!**
- If a model too perfectly matches a given set of data, the chances of it being able to accurate predict other data have greatly diminished
  - Generally called _overfitting_
- It is common then to set aside a portion of data that the model is **not** trained on to serve as a test to compare the model against
- These are generally denoted as the "training" and "testing" data sets
  - A common split is to put about 80% of the observations into the training set, and reserve the remaining 20% for the testing

## The Libraries
- Doing this sort of work benefits greatly from a streamlined architecture and common syntax
  - It helps greatly for the code to look similar despite whatever model is used
- In Python, the gold standard is Scikit-Learn:
  ```bash
  pip install -U scikit-learn
  ```
- In R, TidyModels is the best similar resource I've seen:
  ```r
  install.package('tidymodels')
  ```
- Both can take a bit of time to install

<!--
## The Very Basics (In Python)
- Scikit-learn provides the `sklearn` package in Python, which is the industry standard for doing most ML in Python
- Several useful function to probably always import:
  ```python
  from sklearn.model_selection import train_test_split
  from sklearn.metrics import confusion_matrix
  ```
- Will also commonly want to import in your basic ML model
  - We will initially just look at Logistic Regression
    ```python
    from sklearn.linear_model import LogisticRegression
    ```
-->

## Making the Split (Python)
- To split your data, `train_test_split` can assist:
  ```python
  from sklearn.model_selection import train_test_split
  ```
- Need to include an option `test_size=frac` where `frac` is the amount that you want to set aside for _testing_
- `train_test_split` shuffles the data before making the splits, so you don't need to worry about that
- Usage:
  ```{.python style='font-size:.95em'}
  train_df, test_df = train_test_split(df, 
                                       test_size=0.2,
                                       random_state=0)
  ```

## Making the Split (R)
- To split your data in R, several functions from `rsamples` (part of `tidymodels`) are useful
- You indicate the amount of observations you want to use for _training_
  ```r
  splits <- initial_split(df, prop=0.8)
  train_df <- training(splits)
  test_df <- testing(splits)
  ```
- Useful to also set the seed before splitting for reproducability:
  ```r
  set.seed(num)
  ```

# Logistic Regression
## Modeling
- Within the supervised machine learning for classification domain, there are many possible specific models that can be used for the training
- We'll end up looking at several in this class:
  - Logistic Regression (Multinomial Regression)
  - Decision Trees
  - Random Forests
- Our ML libraries make working with any of these a very similar experience!


## Binomial Logistic Regression
::::::cols
::::col
- Draws a line (or plane or hyperplane) that separates the two groups
- Closer to the division equates to less confidence in the assigned type
::::

::::col
![](../images/log_regression.svg)
::::
::::::


## Behind the Scenes: Multi-Logistic Regression
- Scikit-Learn's handling of multinomial logistic regression uses a "one-vs-rest" model
- Binary logistic regressions are run on each category vs **all** the **other** categories
- Final assignment is determined by whichever model is most confident about that points category
  - Confidence usually builds as you move away from the division line

![](../images/one_vs_rest.svg)

## The Logistic Regression Model (Python)
- The logistic regression model is provided directly from Scikit-Learn:
  ```python
  from sklearn.linear_model import LogisticRegression
  ```
- You need to _initialize_ a model before you can try to fit anything to it
- At its most simple:
  ```python
  model = LogisticRegression()
  ```
- Note that there are a **lot** of options that can be further provided to the model as arguments

## The Logistic Regression Model (R)
- Tidymodels distinguishes between Logistic Regression (2 categories) and Multinomial Regression (>2 categories)
- You still need to _initialize_ a model before you can try to fit anything to it
  ```r
  model <- logistic_reg() # or
  model <- multinom_reg()
  ```
- Can tweak with the underlying engine that powers these, but the default is fine


## Fitting the Model
- Fitting the model is the act of iteratively improving on the fit parameters
- You need to provide your model the training data when doing so, both the feature data and the classification labels (this is supervised remember!)
  ```python
  model_fit = model.fit(train_df[[feature_cols]], 
                        train_df[label_col]
                        )
  ```
  ```r
  model_fit <- model %>% 
    fit(label_col ~ feature_cols, data=train_df)
  ```


# How did we do?
## Evaluating the Model
- Once the model has been fit, the fit can be used to make predictions
  - Generally, the first predictions that should be made should be made on the testing data!
    ```python
    test_df['predicted'] = 
      model_fit.predict(test_df[[feature_cols]])
    ```
    ```r
    test_df <- test_df %>% 
      bind_cols(model_fit %>% predict(test_df)
    ```
  - This should return a list of label predictions
- These could be compared directly to the known labels of the testing data, or, more likely, you may want to create a confusion matrix

## Creating Confusion (Python)
- You could construct the confusion matrix manually, but imports can help
  ```{.python style='font-size:.9em'}
  from sklearn.metrics import confusion_matrix
  from sklearn.metrics import ConfusionMatrixDisplay as CMD
  ```
- Armed with the predictions, you can construct your confusion matrix
  ```{.python style='font-size:.9em'}
  confusion_matrix(test_df[label_col], test_df['predicted'])
  ```

  which will print out the matrix in a Numpy array
- Want graphics? 
  ```{.python style='font-size:.85em'}
  CMD.from_predictions(test_df[label_col], test_df['predicted'])
  ```

## Sewing Confusion (R)
- The `yardstick` library in `tidymodels` provides the `conf_mat` function
- Provides a special confusion matrix object, which can then have a variety of things done with it
  ```r
  cm <- conf_mat(test_df, label_col, .pred_class)
  ```
- Printing this will just give a text representation of the matrix
- Want graphics?
  ```r
  autoplot(cm, type='heatmap')
  ```

## Understanding Probabilities
- Classification models usually internally assign a probability to a point as to what label it should have
  - The dominant probability is what wins, and that label gets assigned
- It can be useful sometimes to see the predicted probabilities for each point, rather than the final category
  ```python
  model_fit.predict_proba(test_df[[feature_cols])
  ```
  ```r
  predict(model_fit, test_df, type='prob')
  ```

## Decision Boundaries
:::::cols
::::col
- It can be a useful aid to visualize where the decision boundaries lie
- This is not quite as simple as extracting the lines that bisect each region, since the decision regions will involve the areas of most confidence in a particular classification
::::
::::col
![](../images/decision_boundary.png){width=100%}
::::
:::::

## Decision Boundary (Python)
- Need to import:
  ```{.python style='font-size:.87em'}
  from sklearn.inspection import DecisionBoundaryDisplay as DBD
  ```
- Create the plot from the estimator:
  ```python
  DBD.from_estimator(model, df[[features]])
  ```
  - Unlike the confusion matrix, here the estimate needs both the model and the feature values to predict from
  - Can also pass in other arguments, like axis labels or the actual axis you want to add the plot to


# Your Turn!
## Activity!
- The dataset [here](../demos/basic_labeling.csv) has two independent variables and then a label column that can be one of three options
- Fit a Logistic Multinomial Regression model to the data and compute the resulting confusion matrix and model accuracy