Q-learning

Author

Clayton Cafiero

Published

2026-06-22

Q-learning

Value iteration is not the only way to generate a policy for an MDP, and indeed there are cases where value iteration is impractical or impossible. One of the requirements of value iteration is that we have complete transition and reward models. We must have these to use value iteration. However, in many cases, we do not have complete information. It may also be the case that the environment we wish to model is too large—too many states, too many actions—to fit into memory. In cases like these, we require a different approach.

Q-learning is one such approach. The big difference with Q-learning is that it is model-free, that is, we do not need a model in advance. Instead we sample state/action pairs from the environment, observe what rewards are gained, and update our policy accordingly.

The significant differences are

Value iteration:

requires model of environment (P(s' \mid s, a) and R(s,a)),
plans using this model, and
updates all states in sweeps.

Q-learning:

is model-free (learns from experience),
updates only visited state-action pairs, and
can learn online from interaction.

That is, value iteration is a planning algorithm, while Q-learning is (true to it’s name) a learning algorithm.

	value iteration	Q-learning
model	model-based—requires P(s' \mid s, a) and R(s)	model-free—learns directly from experience, no P or R needed
what’s learned	state-value function \mathcal{U}(s)	action-value function Q(s, a)
update rule	\mathcal{U}'(s) = R(s) + \gamma \max_a \sum_{s'} P(s' \mid s, a)\,\mathcal{U}(s')	Q(s, a) \leftarrow Q(s, a) + \alpha\big(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\big)
update style	synchronous—one full sweep over every state per iteration	incremental—one (s, a, r, s') sample at a time
source of s'	exact expectation over all s', from the known transition model	a single s' sampled by acting in (or simulating) the environment
exploration	Not needed—the full model is already known	required—agent must explore (e.g., \epsilon-greedy) to visit state-action pairs
learning rate	none	\alpha \in (0, 1] controls how much each new sample updates the estimate
policy extraction	separate step, needs P: \pi^{}(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)\,\mathcal{U}^{}(s')	direct, no model needed: \pi^{}(s) = \arg\max_a Q^{}(s, a)
convergence	Guaranteed after finitely many sweeps (finite S, A; \gamma < 1)	guaranteed in the limit, given infinite visits to every (s,a) and a suitably decaying \alpha
setting	planning—dynamics known in advance	reinforcement learning—dynamics unknown, learned by trial and error

When would we use one rather than the other?

The knowledge requirements of value iteration and Q-learning differ. All other things being equal, we’d use value iteration when we have complete reward and transition models. If we did not have this information, we’d need to take a sampling approach, and thus, we’d use Q-learning.

Another thing to consider is risk. If we cannot afford costly mistakes on the part of our agent, a sample and explore approach could be dangerous. In situations where there’s considerable risk, value iteration is preferred.

Value Iteration:

Factory automation with known machine capabilities
Traffic light timing with known traffic patterns
Inventory management with known supply chains
Robot motion planning with known physics models
Game AI with known rules (e.g., chess)
Industrial processes with known dynamics
Financial planning with well-understood market models

Q-learning:

Recommendation systems learning from user interactions
Autonomous vehicles adapting to different conditions
Trading algorithms learning market patterns
Robot learning in unstructured, dynamic environments
Game AI for complex games (Atari, RPGs)
Customer behavior prediction
Systems with dynamics that are hard to model

The key questions to ask when choosing:

Do I have an accurate model?
Can I afford exploration?
What’s more expensive? Computation or real-world interaction?
How critical are mistakes during learning?
How dynamic is the environment?

These considerations will guide which approach is more suitable for any specific problem.

No generative AI was used in producing drafts of this material. This was written the old-fashioned way. AI was used to rewrite existing pseudocode in LaTeX to produce standalone *.tex files for rendering, and for revisions toward satisfying WCAG 2.1 AA-level accessibility standards as required by UVM policy. AI may also have been used to proofread selected human-written prose. Claude 2.1 with model Sonnet 4.6. Revisions, if any, were performed by the author. AI was not used in generating any code whatsoever. All code samples and starter code are by the author only.

Reuse

CC BY-NC-SA 4.0