---
title: "Such Pretty Trees"
author: Jed Rembold
date: March 20, 2025
slideNumber: true
theme: tokyo-night-light
highlightjs-theme: tokyo-night-light
width: 1920
height: 1080
transition: slide
---


## Announcements
- Homework 4!
  - Due the Thursday after we return (2 weeks from today)
  - A check-in this weekend before you "check out" on Spring Break--don't forget!
- I'll be around my computer Spring Break if you have questions come up, but also give yourself a break


## Recap
- Split data into training and testing sets before fitting to have something to compare against
- Logistic (and Multinomial) Regression labels categories by determining lines/planes/hyperplanes to separate groups
- The general flow of supervised classification learning looks like:
  - Split your data
  - Choose your model and initialize it
  - Fit the model to your training data
  - Use the fitted model to predict labels for your testing data
  - Evaluate how the model did: confusion matrix

## Discussing Today
- Understanding model results
  - Visualizing decision boundaries (when reasonable)
- Practice time!
- Introducing model #2: Decision Trees
- Ensembles (if time)

# Drawing Boundaries
## Decision Boundaries
:::::cols
::::col
- It can be a useful aid to visualize where the decision boundaries lie
- This is not quite as simple as extracting the lines that bisect each region, since the decision regions will involve the areas of most confidence in a particular classification
- Really only reasonable for data with 2 independent variables or features
::::
::::col
![](../images/decision_boundary.png){width=100%}
::::
:::::

## Decision Boundary (Python)
- Need to import:
  ```{.python style='font-size:.87em'}
  from sklearn.inspection import DecisionBoundaryDisplay as DBD
  ```
- Create the plot from the estimator:
  ```python
  DBD.from_estimator(model, df[[features]])
  ```
  - Unlike the confusion matrix, here the estimator needs both the model and the feature values to predict from
  - Can also pass in other arguments, like axis labels or the matplotlib axis you want to add the plot to


## Decision Boundary (R)
:::{style='font-size:.9em'}
- Tidymodels doesn't really have a comparable function to DecisionBoundaryDisplay, but it can be computed directly with other tools
- Main idea is to compute a fine grid of points in feature-space that you'll use to get predictions
  ```r
  pred_df <- expand.grid(
                feat1 = seq(low, high, length.out=1000),
                feat2 = seq(low, high, length.out=1000))
  ```
- Then get predictions from your model
  ```r
  pred_cat <- model_fit %>% predict(pred_df)
  ```
- Plot the results with a raster plot with the fill determined by the predicted label
  ```r
  ggplot(pred_df, aes(feat1, feat2, fill=.pred_class)) +
    geom_raster(alpha = 0.5)
  ```
:::

# Practice Time!
## Activity!
- The dataset [here](../demos/basic_labeling.csv) has two independent variables and then a label column that can be one of three options
- Fit a Logistic Multinomial Regression model to the data and compute the resulting confusion matrix and model accuracy
- If time, plot the decision boundaries


# A strong, independent tree
## Why other models?
:::::cols
::::col
- We are going to look at some alternative classification models, but why?
- Logistic Regression is not going to classify non-linear relationships well
- Different strengths with respect to preprocessing or feature distribution
::::
::::col
![](../images/banana_dist.png){width=80%}
::::
:::::

## A Decision Tree

![](../images/decision_tree.png){width=80%}


## Planting Trees 1 {data-transition="fade" data-transition-speed="slow"}
![](../images/banana_tree1.png){width=90%}

## Planting Trees 2 {data-transition="fade" data-transition-speed="slow"}
![](../images/banana_tree2.png){width=90%}

## Planting Trees 3 {data-transition="fade" data-transition-speed="slow"}

![](../images/banana_tree3.png){width=90%}

## Minimizing Impurity
- Two common methods used to evaluate impurity
  - Gini Index:
    $$ H_{gini} = \sum_{k\in y} p_{mk}(1-p_{mk}) $$
  - Cross-Entropy:
    $$ H_{CE} = - \sum_{k\in y} p_{mk}\log(p_{mk}) $$
  - In both:
    - $y$ are all the different classes
    - $p_{mk}$ is the distribution or proportion of points that are of class $k$ in node $m$


## Creating Decision Trees (Python)
- Import the classifier:
  ```python
  from sklearn.tree import DecisionTreeClassifier
  ```
- Create your model:
  ```python
  tree = DecisionTreeClassifier()
  ```
- We will mention available options in just a moment, as they can be more important here
- Everything else works the same as the logistic regression models!

## Creating Decision Trees (R)
- You already have the model as part of the parsnip library
- Create your model:
  ```r
  tree <- decision_tree(mode="classification")
  ```
- Everything else works the same as the logistic regression models!


## Visualizing Decision Trees (Python)
- Can also import a plotter for Matplotlib to sketch out nice trees
  ```python
  from sklearn.tree import plot_tree
  ```
- Then just pass the tree model after fitting it
  ```python
  plot_tree(tree)
  ```
- Can adjust the `filled` option to color the nodes, or `feature_names` to show better comparisons

## Visualizing Decision Trees (R)
- You need another package to plot the decision trees nicely
  ```r
  library(rpart.plot)
  ```
- Then just pass the tree model after fitting it
  ```r
  rpart.plot(tree_fit$fit)
  ```
- Can adjust the `type` and `extra` to different numbers to further customize

## Parameter Tuning
- Decision Tree classifiers have several parameters that can be tuned to adjust the proportions of the tree
- Often called _pruning_, and falls into either _pre-pruning_ or _post-pruning_ categories
  - Pre-pruning limits the tree's size as it is build (pick one, maybe two)
    
    ::::::{.cols style='align-items:flex-start'}
    ::::col
    - Python
      - `max_depth`
      - `max_leaf_nodes`
      - `min_samples_split`
      - `min_impurity_decrease`
    ::::
    
    ::::col
    - R
      - `tree_depth`
      - `min_n`
    ::::
    ::::::
    

## Post-Pruning
- Post-pruning takes the full tree and then proceeds to "snip" off branches that don't have much happening
- Most common approach is likely _cost complexity pruning_
  $$ R_\alpha(T) = R(T) + \alpha |T| $$
  - Where 
    - $R(T)$ is the total leaf impurity
    - $|T|$ is the number of leaf nodes
    - $\alpha$ is a free parameter that is chosen
- In Python: indicate with `ccp_alpha` when creating the tree
- In R: indicate with `cost_complexity` when creating the tree


## Instability
- A drawback of decision trees is that they are inherently unstable
- The tree you get will depend heavily on the randomized training and test sets
- They might do a similarly good job of prediction, but they can look wildly different
- The `random_state` option (or setting a seed) in examples have been fixing this so far. But remove that and run the same trial multiple times


## Feature Importance
- It can be useful to get an idea of what features are most important in constructing the tree
- Once the model has been fit, you can query the model to get this information:
  ```python
  tree.feature_importances_
  ```
- This returns a list of relative importance for each feature, in the same order as the features you passed into the model originally
- In R you need the `vip` library
  ```r
  library(vip)
  vip(tree_fit)
  ```


## Activity
- Taking the same dataset from earlier ([here!](../demos/basic_labeling.csv)) build a classifier using a decision tree
- For several different forms of pruning, create a tree and then compare its confusion matrix to the others

# Our Powers Combined...
## Why Ensembles?
- Ensembles leverage the idea that many efforts trying to get the right answer will be off in a **random** way
- This is in fact what the Galaxy Zoo project does!
  - Most volunteers are not experts, and will make mistakes from time to time
  - But the mistakes will be random, and thus averaging them will get close to the correct answer
- You can actually do this with any machine learning model


## Voting Classifiers
- You can construct your own arbitrary ensembles
- In Python:
  ```python
  combo = VotingClassifier( 
              [ 
                ('logreg',log_mod), # first model
                ('dec_tree',tree)   # second model
              ], 
              voting='soft')
  combo.fit(training_df[['x','y']])
  ```
- I haven't yet found an equivalent in R, but I'd imagine it is out there


## Gerrymandering (Voting Boundaries)
![](../images/python/VotingClassifier.png){width=100%}