Q-learning

Author

Clayton Cafiero

Published

2026-06-22

Q-learning

Value iteration is not the only way to generate a policy for an MDP, and indeed there are cases where value iteration is impractical or impossible. One of the requirements of value iteration is that we have complete transition and reward models. We must have these to use value iteration. However, in many cases, we do not have complete information. It may also be the case that the environment we wish to model is too large—too many states, too many actions—to fit into memory. In cases like these, we require a different approach.

Q-learning is one such approach. The big difference with Q-learning is that it is model-free, that is, we do not need a model in advance. Instead we sample state/action pairs from the environment, observe what rewards are gained, and update our policy accordingly.

The significant differences are

Value iteration:

  • requires model of environment (P(s' \mid s, a) and R(s,a)),
  • plans using this model, and
  • updates all states in sweeps.

Q-learning:

  • is model-free (learns from experience),
  • updates only visited state-action pairs, and
  • can learn online from interaction.

That is, value iteration is a planning algorithm, while Q-learning is (true to it’s name) a learning algorithm.

value iteration Q-learning
model model-based—requires P(s' \mid s, a) and R(s) model-free—learns directly from experience, no P or R needed
what’s learned state-value function \mathcal{U}(s) action-value function Q(s, a)
update rule \mathcal{U}'(s) = R(s) + \gamma \max_a \sum_{s'} P(s' \mid s, a)\,\mathcal{U}(s') Q(s, a) \leftarrow Q(s, a) + \alpha\big(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\big)
update style synchronous—one full sweep over every state per iteration incremental—one (s, a, r, s') sample at a time
source of s' exact expectation over all s', from the known transition model a single s' sampled by acting in (or simulating) the environment
exploration Not needed—the full model is already known required—agent must explore (e.g., \epsilon-greedy) to visit state-action pairs
learning rate none \alpha \in (0, 1] controls how much each new sample updates the estimate
policy extraction separate step, needs P: \pi^{*}(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)\,\mathcal{U}^{*}(s') direct, no model needed: \pi^{*}(s) = \arg\max_a Q^{*}(s, a)
convergence Guaranteed after finitely many sweeps (finite S, A; \gamma < 1) guaranteed in the limit, given infinite visits to every (s,a) and a suitably decaying \alpha
setting planning—dynamics known in advance reinforcement learning—dynamics unknown, learned by trial and error


When would we use one rather than the other?

The knowledge requirements of value iteration and Q-learning differ. All other things being equal, we’d use value iteration when we have complete reward and transition models. If we did not have this information, we’d need to take a sampling approach, and thus, we’d use Q-learning.

Another thing to consider is risk. If we cannot afford costly mistakes on the part of our agent, a sample and explore approach could be dangerous. In situations where there’s considerable risk, value iteration is preferred.

Value Iteration:

  • Factory automation with known machine capabilities
  • Traffic light timing with known traffic patterns
  • Inventory management with known supply chains
  • Robot motion planning with known physics models
  • Game AI with known rules (e.g., chess)
  • Industrial processes with known dynamics
  • Financial planning with well-understood market models

Q-learning:

  • Recommendation systems learning from user interactions
  • Autonomous vehicles adapting to different conditions
  • Trading algorithms learning market patterns
  • Robot learning in unstructured, dynamic environments
  • Game AI for complex games (Atari, RPGs)
  • Customer behavior prediction
  • Systems with dynamics that are hard to model

The key questions to ask when choosing:

  1. Do I have an accurate model?
  2. Can I afford exploration?
  3. What’s more expensive? Computation or real-world interaction?
  4. How critical are mistakes during learning?
  5. How dynamic is the environment?

These considerations will guide which approach is more suitable for any specific problem.

Copyright © 2023–2026 Clayton Cafiero

No generative AI was used in producing drafts of this material. This was written the old-fashioned way. AI was used to rewrite existing pseudocode in LaTeX to produce standalone *.tex files for rendering, and for revisions toward satisfying WCAG 2.1 AA-level accessibility standards as required by UVM policy. AI may also have been used to proofread selected human-written prose. Claude 2.1 with model Sonnet 4.6. Revisions, if any, were performed by the author. AI was not used in generating any code whatsoever. All code samples and starter code are by the author only.