Such Pretty Trees

Jed Rembold

March 16, 2026

Announcements

  • Homework 8 due tonight!
  • Homework 9 is due…when?
    • Next Monday? March 23?
    • Monday after Spring Break? March 30?
  • Quiz 2 is the Wednesday after Spring Break
    • Study materials will go up Wednesday of this week
  • I’ll be around my computer Spring Break if you have questions come up, but also give yourself a break

Recap

  • Split data into training and testing sets before fitting to have something to compare against
  • Logistic (and Multinomial) Regression labels categories by determining lines/planes/hyperplanes to separate groups
  • The general flow of supervised classification learning looks like:
    • Split your data
    • Choose your model and initialize it
    • Fit the model to your training data
    • Use the fitted model to predict labels for your testing data
    • Evaluate how the model did: confusion matrix

Discussing Today

  • Understanding model results
    • Visualizing decision boundaries (when reasonable)
  • Practice time!
  • Introducing model #2: Decision Trees
  • Ensembles (if time)

Drawing Boundaries

Decision Boundaries

  • It can be a useful aid to visualize where the decision boundaries lie
  • This is not quite as simple as extracting the lines that bisect each region, since the decision regions will involve the areas of most confidence in a particular classification
  • Really only reasonable for data with 2 independent variables or features

Decision Boundary (Python)

  • Need to import:

    from sklearn.inspection import DecisionBoundaryDisplay as DBD
  • Create the plot from the estimator:

    DBD.from_estimator(model, df[[features]])
    • Unlike the confusion matrix, here the estimator needs both the model and the feature values to predict from
    • Can also pass in other arguments, like axis labels or the matplotlib axis you want to add the plot to

Decision Boundary (R)

  • Tidymodels doesn’t really have a comparable function to DecisionBoundaryDisplay, but it can be computed directly with other tools

  • Main idea is to compute a fine grid of points in feature-space that you’ll use to get predictions

    pred_df <- expand.grid(
                  feat1 = seq(low, high, length.out=1000),
                  feat2 = seq(low, high, length.out=1000))
  • Then get predictions from your model

    pred_cat <- model_fit %>% predict(pred_df)
  • Plot the results with a raster plot with the fill determined by the predicted label

    ggplot(pred_df, aes(feat1, feat2, fill=.pred_class)) +
      geom_raster(alpha = 0.5)

Practice Time!

Activity!

  • The dataset here has two independent variables and then a label column that can be one of three options
  • Fit a Logistic Multinomial Regression model to the data and compute the resulting confusion matrix and model accuracy
  • If time, plot the decision boundaries

A strong, independent tree

Why other models?

  • We are going to look at some alternative classification models, but why?
  • Logistic Regression is not going to classify non-linear relationships well
  • Different strengths with respect to preprocessing or feature distribution

A Decision Tree

Planting Trees 1

Planting Trees 2

Planting Trees 3

Minimizing Impurity

  • Two common methods used to evaluate impurity
    • Gini Index: \[ H_{gini} = \sum_{k\in y} p_{mk}(1-p_{mk}) \]
    • Cross-Entropy: \[ H_{CE} = - \sum_{k\in y} p_{mk}\log(p_{mk}) \]
    • In both:
      • \(y\) are all the different classes
      • \(p_{mk}\) is the distribution or proportion of points that are of class \(k\) in node \(m\)

Creating Decision Trees (Python)

  • Import the classifier:

    from sklearn.tree import DecisionTreeClassifier
  • Create your model:

    tree = DecisionTreeClassifier()
  • We will mention available options in just a moment, as they can be more important here

  • Everything else works the same as the logistic regression models!

Creating Decision Trees (R)

  • You already have the model as part of the parsnip library

  • Create your model:

    tree <- decision_tree(mode="classification")
  • Everything else works the same as the logistic regression models!

Visualizing Decision Trees (Python)

  • Can also import a plotter for Matplotlib to sketch out nice trees

    from sklearn.tree import plot_tree
  • Then just pass the tree model after fitting it

    plot_tree(tree)
  • Can adjust the filled option to color the nodes, or feature_names to show better comparisons

Visualizing Decision Trees (R)

  • You need another package to plot the decision trees nicely

    library(rpart.plot)
  • Then just pass the tree model after fitting it

    rpart.plot(tree_fit$fit)
  • Can adjust the type and extra to different numbers to further customize

Parameter Tuning

  • Decision Tree classifiers have several parameters that can be tuned to adjust the proportions of the tree
  • Often called pruning, and falls into either pre-pruning or post-pruning categories
    • Pre-pruning limits the tree’s size as it is build (pick one, maybe two)

      • Python
        • max_depth
        • max_leaf_nodes
        • min_samples_split
        • min_impurity_decrease
      • R
        • tree_depth
        • min_n

Post-Pruning

  • Post-pruning takes the full tree and then proceeds to “snip” off branches that don’t have much happening
  • Most common approach is likely cost complexity pruning \[ R_\alpha(T) = R(T) + \alpha |T| \]
    • Where
      • \(R(T)\) is the total leaf impurity
      • \(|T|\) is the number of leaf nodes
      • \(\alpha\) is a free parameter that is chosen
  • In Python: indicate with ccp_alpha when creating the tree
  • In R: indicate with cost_complexity when creating the tree

Instability

  • A drawback of decision trees is that they are inherently unstable
  • The tree you get will depend heavily on the randomized training and test sets
  • They might do a similarly good job of prediction, but they can look wildly different
  • The random_state option (or setting a seed) in examples have been fixing this so far. But remove that and run the same trial multiple times

Feature Importance

  • It can be useful to get an idea of what features are most important in constructing the tree

  • Once the model has been fit, you can query the model to get this information:

    tree.feature_importances_
  • This returns a list of relative importance for each feature, in the same order as the features you passed into the model originally

  • In R you need the vip library

    library(vip)
    vip(tree_fit)

Activity

  • Taking the same dataset from earlier (here!) build a classifier using a decision tree
  • For several different forms of pruning, create a tree and then compare its confusion matrix to the others

Our Powers Combined…

Why Ensembles?

  • Ensembles leverage the idea that many efforts trying to get the right answer will be off in a random way
  • This is in fact what the Galaxy Zoo project does!
    • Most volunteers are not experts, and will make mistakes from time to time
    • But the mistakes will be random, and thus averaging them will get close to the correct answer
  • You can actually do this with any machine learning model

Voting Classifiers

  • You can construct your own arbitrary ensembles

  • In Python:

    combo = VotingClassifier( 
                [ 
                  ('logreg',log_mod), # first model
                  ('dec_tree',tree)   # second model
                ], 
                voting='soft')
    combo.fit(training_df[['x','y']])
  • I haven’t yet found an equivalent in R, but I’d imagine it is out there

Gerrymandering (Voting Boundaries)

// reveal.js plugins // Added plugins ,