The Dark Forest

Jed Rembold

March 18, 2026

Announcements

  • HW9 is due the Monday after break!
    • You have everything you need after today
  • Quiz 2 on Exoplanets and Galaxies two weeks from today
    • Review and study materials going out today
  • Will start a new unit the Monday we get back

Recap

  • For certain data, alternative models are necessary to capture relationships within the data
  • Decision Trees are one alternative method of creating a classifier model
    • Comprised of a conditional tree of yes/no questions
    • Each question attempts to minimize the resulting overall “impurity” of the classifications
    • Utilizing a full decision tree will often overfit the data, and thus various parameters can be chosen to “prune” the tree
    • Decision trees are unstable in that the resulting tree can heavily depend on the randomized initial training data

Discussing Today

  • Useful Tidbits
  • Combining Models
    • When and why might an ensemble of models give better results?
    • How can we easily generate new “models”?
  • Introducing and understanding random forest models

Tidbits

Selecting all but one

  • In both Python and R, we commonly need to select many feature columns, which can be cumbersome

  • Can select all “but one” fairly easily:

    model.fit(
      train_df.drop('label', axis=1), 
      train_df.label
    )
    model_fit <- model %>% 
      fit(label ~ . - label, data=train_df)

Iterations and Convergence

  • ML models operate on iterative approaches toward some optimum

  • If they move slowly toward that optimum, you might need to allow for more iterations

    model = LogisticRegression(max_iter=10000)
    model <- multinom_reg() %>% 
      set_engine("nnet", maxit=10000)
  • Can also try different solvers/engines

    LogisticRegression(solver="newton-cg")
    model <- multinom_reg() %>% 
      set_engine("glmnet", maxit=10000)

Strength in Numbers

Why Ensembles?

  • Ensembles leverage the idea that many efforts trying to get the right answer will be off in a random way
  • This is in fact what the Galaxy Zoo project does!
    • Most volunteers are not experts, and will make mistakes from time to time
    • But the mistakes will be random, and thus averaging them will get close to the correct answer
  • You can actually do this with any machine learning model

Voting Classifiers

  • You can construct your own arbitrary ensembles

  • In Python:

    combo = VotingClassifier( 
                [ 
                  ('logreg',log_mod), # first model
                  ('dec_tree',tree)   # second model
                ], 
                voting='soft')
    combo.fit(training_df[['x','y']])
  • A bit more complicated, but can be done in R with the stacks library

Gerrymandering (Voting Boundaries)

Generating Chaos

Creating Model Variations

  • Changing the random_state won’t always generate a very different model
    • And in some cases, like logistic regression, there is no random element
  • A better way is using bagging (Bootstrap Aggregation)
  • In SkLearn as BaggingClassifier
  • In R from baguette library

Bias and Variance

Applications to Ensembles

  • The ability of an ensemble of classifiers to generalize depends on:
    • The strength of the individual classifiers (how well does each individual model do at predicting a class?)
    • The inverse correlation of the models
      • 100 of essentially the same model isn’t going to gain you anything in the averaging
      • The more distinct the models are in their own unique errors, the better the ensemble will operate as a whole
  • Often worth making an individual model worse if it uncorrelates the ensemble models

Entering the Forest

Building a Forest

Randomizing Trees

  • Random forests are created by randomizing each tree in two ways:
    • Bagging the training data for each tree
    • For each split, picking a random subset of features to use
  • More trees are always better, but will slow training

Creating a Random Forest

  • In Python, random forest models come from the Scikit-learn ensemble module

    from sklearn.ensemble import RandomForestClassifier
    
    forest = RandomForestClassifier()
  • In R, they are one of Parsnips available models

    forest <- rand_forest(mode="classification")
  • Fitting, computing confusion matrices, visualizing classification boundaries, etc. proceed exactly as they have for other models

Trimming the Forest

Tuning Forests (Python)

  • How many trees in the forest?
    • n_estimators (default=100)
  • Main tuning parameter: max_features
    • Determines how many features to keep at each split
    • For classification, using about \(\sqrt{\text{n_features}}\) is best, and this is the default
  • Tree pre-pruning can still help!
    • If nothing else with model size and training time
    • Still max_depth, max_leaf_nodes, etc.

Tuning Forests (R)

  • How many trees in the forest?
    • trees (default=100)
  • Main tuning parameter: mtry
    • Determines how many features to keep at each split
    • Default is the same \(\sqrt{\text{n_features}}\)
  • Tree pre-pruning can still help!
    • min_n is the minimum number of points still in a node to be split further

Warming Up

  • More trees is always better, but at a cost to time
  • How many trees get you the most “bang for your (time) buck”?
  • Sklearn’s warm_start option lets you resume from last training point

Activity

  • Taking the same dataset from before (here!) build a classifier using a random forest
  • For several sizes of forest and pre-pruning parameters, create a forest and then compare its confusion matrix to the others.
  • If you have a time, see if you can determine the least number of trees you need while still maximizing your model’s effectiveness.

Working Time!

HW9 Working Time!

  • I’ve set aside the rest of class for you to touch base/work with your partner(s) on HW9
  • Ensure you are on track! Limit what you need to do over break!
  • Questions for me? Ask them now! :)
// reveal.js plugins // Added plugins ,