Decision trees: an introduction

Author

Clayton Cafiero

Published

2026-06-22

Decision Trees

A decision tree is a function which maps a sequence of decisions to a discrete-valued target. A decision tree classifies instances in some data set, partitioning them into smaller subsets with the goal of producing “pure” subsets. A pure subset is one for which all elements have the same discrete value used in classification. Each interior node of the tree specifies a test for some attribute. Leaf nodes are predictions or classifications.

A naïve approach

It’s quite likely that you’ve played a game like twenty questions before. In this game, one person thinks of an object, and the others must guess what the object is. The guessers may ask up to twenty yes/no questions to help ascertain the nature of the object in question. The person thinking of the object must answer honestly.

Without realizing it, players are often constructing a decision tree. They ask questions sequentially, and narrow down the possibilities based on the answers they receive to their questions.

As a preliminary exercise to set the stage for a more formal treatment, consider this scenario. Imagine you are told you’re playing a game of twenty questions, but that the universe of possibilities is only among the following: parakeet, raven, goldfish, shark, dog, cat, horse. How would you discover the secret animal? Take a moment and think of what questions you’d ask if the objective were to guess correctly from among these animals using the fewest questions. What questions would you ask, and in what order would you ask them?

Example #1

Does it have feathers?
- If yes: Is it black?
  - If yes: raven
  - If no: parakeet
- If no (does not have feathers): Is it warm-blooded?
  - If yes: Can I ride on its back?
    - If yes: horse
    - If no: Does it fetch a stick?
      - If yes: dog
      - If no: cat
  - If no (not warm blooded): Does it have sharp teeth?
    - If yes: shark
    - If no: goldfish

Example #2

Does it have four legs?
- If yes: Does it weigh more than 100 kg?
  - If yes: horse
  - If no: Does it meow?
    - If yes: cat
    - If no: dog
- If no (does not have four legs): Does it have any legs at all?
  - If yes: Does it eat seeds?
    - If yes: parakeet
    - If no: raven
  - If no (no legs): Might a cat try to eat it?
    - If yes: goldfish
    - If no: shark

No doubt you came up with different questions, or perhaps the same or similar questions but in a different order. That’s fine, as long as your classification system works. Notice also that the questions we’d ask depend on the answers given to earlier questions. This imposes a tree-like structure on our questions. Indeed, both of these examples can be rendered quite nicely as trees.

Notice a few things here. Both sets of questions allowed us to uniquely identify the target in question: whether the secret animal was a goldfish, raven, cat, shark, horse, parakeet, or dog. Notice also that each set of questions is completely distinct from the other—they share no common questions. Notice also that the trees have different topology (branching structure). Nevertheless, both seem to do well with our guessing game.

In both cases, each question determines some attribute of the secret animal. Does it have four legs? Does it have sharp teeth? Is it black? These are all attributes that may or may not apply to each animal.

Now, let’s loosen things up just a little and permit questions with non-binary answers—not just “yes” or “no”, but rather discrete values from some set of possibilities. With this permitted, we might ask the questions like this (example #3):

How many legs does it have?
- If four: Can it be litter-trained?
  - If yes: cat
  - If no: Does it have hooves?
    - If yes: horse
    - If no: dog
- If two: Quoth it “nevermore”?
  - If yes: raven
  - If no: parakeet
- If zero: Does it have bones?
  - If yes: goldfish
  - If no: shark

This can be rendered as a tree, thus:

Here we have yet another working classification, based on yet another set of entirely different questions.

Before we take up the question of how we might evaluate trees—determining whether one is “better” than another—or how we might systematize the construction of trees, let’s look a little more closely at what’s going on in this process.

Let’s say we started with a sample of one individual of each of the species in question, and let’s choose a tree—say #2. Then, before asking any questions we’d have a set of seven animals: goldfish, raven, cat, shark, horse, parakeet, and dog. At each non-leaf node in the tree, some question partitions the data into subsets. Here’s a modified view of tree #2, indicating the root data set (all seven animals), and how data are partitioned as we approach leaf nodes by asking more questions.

Tree #2a (rendering example #2, showing root data and subsets obtained)

So we start with a set of samples. We ask questions about attributes. Values of attributes partition data into subsets. In the end, each leaf node represents a classification.

No generative AI was used in producing drafts of this material. This was written the old-fashioned way. AI was used to rewrite existing pseudocode in LaTeX to produce standalone *.tex files for rendering, and for revisions toward satisfying WCAG 2.1 AA-level accessibility standards as required by UVM policy. AI may also have been used to proofread selected human-written prose. Claude 2.1 with model Sonnet 4.6. Revisions, if any, were performed by the author. AI was not used in generating any code whatsoever. All code samples and starter code are by the author only.

Reuse

CC BY-NC-SA 4.0