Decision trees: some problems and fixes

Author

Clayton Cafiero

Published

2026-06-22

Pros and cons of decision trees

Decision trees are easy to construct and easy to understand and interpret. Even non-experts can often look at a decision tree and understand how a given classification or prediction was reached.
They are robust to noisy or incomplete data.
They naturally capture non-linear relationships in data.
They can tell us which features are most significant in producing a classification or prediction.
They are computationally efficient, and they can be pruned to improve performance.

On the “con” side…

Like any machine learning approach they can overfit.
ID3 and other algorithms can produce good trees, but solving for an optimal tree is \NP-complete.
They can be biased with imbalanced data.

Let’s see how we can address some of these problems or make our trees even better.

Handling imbalanced data

Let’s say we were building a model to predict wildfires. As often as we hear about them, they are still quite infrequent for any region under investigation. For example, out of 1,000 days in county X somewhere out west, we might observe one or two wildfires. That is, they occur in only 0.1–0.2% of days. If we had 10,000 days of training data (about 27 years), we’d have around 9,990 negative samples, but only 10 or 20 positive samples. What’s the problem with that? Our model could ignore important signals and always predict “no wildfire today” and it would have 99.9% accuracy! But this is one we don’t want to get wrong—lives may be at stake!

What can be done? One fix is with class weights. What we do is we tell the algorithm that it’s really important that we don’t misclassify some class by increasing weight for that class (e.g., positive wildfire). For this we need to modify our measure of entropy.

Recall that our standard entropy is computed with

H(S) \equiv \displaystyle\sum^c_{i=1} -P_i \log_2 P_i.

We can modify this with class weights.

P_i = \frac{w_i n_i}{\displaystyle\sum_{j=1}^{c} w_j n_j}

where the w_i are the weights for classes, the n_i are the raw counts of samples in each class.

The term \sum\limits_{j=1}^{c} w_j n_j normalizes values, and we get

H_{\text{weighted}}(S) = -\sum\limits_i \left(\frac{w_i n_i}{\sum\limits_{j=1}^{c} w_j n_j} \right) \log_2 \left(\frac{w_i n_i}{\sum\limits_{j=1}^{c} w_j n_j}\right).

Resampling

Resampling is a rather different approach to the same problem. Rather than apply weights to various classes, we sample differentially to balance our data.

With oversampling, we duplicate existing minority samples to increase their abundance in the training data. This is not without its drawbacks. For example, we can overfit by training on duplicate samples and if there’s noise in the under-represented data we can increase the noise in our training data.

With undersampling we reduce the number of majority class samples in our training data. That is, we select at random, only a subset of data representing the majority class. This reduces the size of our data set and reduces bias due to imbalance. However, we can also throw away useful data, or underfit—where our model never achieves high accuracy.

Unnecessary depth

Deep trees can overfit by memorizing data, especially by memorizing noise in the data. Accordingly we can prune our trees, by either (or both) of two methods.

With pre-pruning we just stop branching after a certain cutoff or threshold is reached. This is essentially an early stopping strategy. Pruning (terminating construction of the tree along a given branch) may be triggered by

a depth threshold,
reaching a minimum number of samples in leaf nodes, or
halting when information gain falls below some given threshold.

With post-pruning we construct the whole tree first then remove unhelpful branches. In order to do this, we evaluate each internal node and ask what would happen if we were to replace a subtree with a single leaf node (indicating the majority class). If we determine this does not have a large adverse effect on accuracy we prune. In fact, pruning might actually improve generalization and reduce overfitting.

No generative AI was used in producing drafts of this material. This was written the old-fashioned way. AI was used to rewrite existing pseudocode in LaTeX to produce standalone *.tex files for rendering, and for revisions toward satisfying WCAG 2.1 AA-level accessibility standards as required by UVM policy. AI may also have been used to proofread selected human-written prose. Claude 2.1 with model Sonnet 4.6. Revisions, if any, were performed by the author. AI was not used in generating any code whatsoever. All code samples and starter code are by the author only.

Reuse

CC BY-NC-SA 4.0