Random forest

Author

Clayton Cafiero

Published

2026-06-22

Ensemble learning with random forest

Recall that with ensemble learning, rather than trying to construct the one, “perfect” classifier to solve a classification problem, we generate a number of classifiers and then look for consensus (or majority). In the case of decision trees, we can build an ensemble using a technique called “random forest” which was developed by Tin Kam Ho at Bell Labs in 1995 and improved by Leo Breiman, of University of California, Berkeley, in 1996 (and later).

This is a powerful technique which is particularly useful when faced with a complex problem. It’s quite possible that a problem might be more than a single decision tree can handle.

Here’s how it works:

We make multiple decision trees.
We let the decision trees vote on a given problem instance to arrive at a consensus prediction.

Of course, the devil is in the details, and the most crucial detail when it comes to constructing an ensemble is that individual members of the ensemble should be independent of one another when making predictions. Why? Because if the errors the various trees make are correlated, then they cannot correct one another when we take the consensus. This is crucial: for a random forest to work, the individual trees must be independent to the greatest extent possible.

So how do we achieve this? There are two things we can do:

“bagging” or bootstrap aggregation, and
sampling input features.

Bagging

“Bagging” is (IMHO) an unfortunate portmanteau of bootstrap aggregation. (Blame Breiman.) When we construct our random forest, bagging is used to select data that are presented during training, as we build the trees. It means that we build separate training data sets for each decision tree, and we construct these sets by sampling, with replacement, from the training data. That is, we train each tree in our random forest on a random subset of training data.

Feature selection

Feature selection is a process whereby a subset of features (aka “attributes”) is chosen for each split. So, for example if we have twenty features, tree A might have features 3, 7, 8, 10, and 19 available to it, and tree B might have features 1, 5, 11, 12, and 15 available to it. We train each tree in our random forest on a random selection of features.

Random forest

By using bagging to select training data, and feature selection to select features, we can substantially increase the likelihood of our trees being independent. If trees see different data, and can be built with differing subsets of attributes, it will be less likely that their errors are correlated.

When creating a random forest, we must specify:

m, the number of trees in our forest,
n, the number of items selected from the training data, using bagging, for each tree we train,
k, the number of features available to each tree.

It is often the case that we also specify the depth limit for each tree, d, as an additional safeguard against overfitting.

Explainability

One of the big advantages of decision trees is that they are easily explained. When a decision tree makes a prediction, we can see exactly which decisions led to the prediction. This much-valued explainability holds true for random forests as well. With random forests, we can inspect multiple trees to ascertain what features (attributes) are most salient. This can give us useful intelligence for feature engineering (to be discussed later).

Benefits

Random forests do not require pruning of trees.
Random forests give us good measure of the relative importance of features (explainability).
Random forests tend to be highly accurate.
Random forests run efficiently on large data sets.
Random forests can handle thousands of features.
Random forests can handle missing data.

Problems

As good as they are, random forests can still overfit. We should always keep an eye on the difference between training accuracy and test accuracy.

If we have data with categorical features (not continuous variables) random forests are biased towards features with more categories. For example, if our features include color and we have 400 color values, then the color attribute would be favored over an attribute with fewer values like finish (matte, eggshell, gloss, and high gloss).

(The author tips his hat to TA Layla Musallam for assistance in typing up hen-scratched notes.)

No generative AI was used in producing drafts of this material. This was written the old-fashioned way. AI was used to rewrite existing pseudocode in LaTeX to produce standalone *.tex files for rendering, and for revisions toward satisfying WCAG 2.1 AA-level accessibility standards as required by UVM policy. AI may also have been used to proofread selected human-written prose. Claude 2.1 with model Sonnet 4.6. Revisions, if any, were performed by the author. AI was not used in generating any code whatsoever. All code samples and starter code are by the author only.

Reuse

CC BY-NC-SA 4.0