From single shots to sequences
Lectures 1–4 are a ladder of difficulty. Each rung keeps the goal — decide rationally — but makes the world harder.
We started with a single agent against an unspecified environment, single-shot. Then game theory, where the difficulty comes from the interaction between agents. But even there, the decisions are still single-shot. Now: instead of a single decision, a sequence of decisions made in time, that may influence each other.
Decision theory
One agent, one shot, an indifferent environment. Pick the single best action. (Lecture 1)
Game theory
Many agents, still one shot each. Difficulty = their interests interact. (Lecture 3)
Sequential decisions
One agent, many decisions over time. Today's action changes tomorrow's state. (today)
Why does sequentiality matter so much? Because your decisions change the future — not just the immediate reward, but the very situations you'll face next, and even your own uncertainty about them.
Imagine a decision-support system for treating patients. If you model each treatment as single-shot, you assume your advice never affects the future. But a good system improves the hospital's performance, which changes which patients arrive next. Pretending the decisions are independent gives you a suboptimal, irrational system.
The same shape shows up everywhere decisions repeat: recommender systems (Amazon/Netflix update what they show based on how you react), adaptive pricing (your trades move the asset's price), and resource allocation (how you spend oil/gas/compute now changes what's left later).
We'll meet three settings, sorted by how much the agent knows about its environment:
🟢 Knows everything → Markov Decision Processes (solved with dynamic programming) · 🔴 Knows nothing → Reinforcement Learning (Q-learning) · 🟡 Has a simulator → Monte Carlo Tree Search.
Markov Decision Processes
Our mathematical model for an agent interacting with an environment that changes over time.
Time is broken into discrete steps (we skip continuous time — too complex for this course). At each step the same little loop repeats:
S — the set of states the environment can be in. A — the actions the agent can take. P — the transition function: the probability of landing in state s′ if you do action a in state s. R — the reward (the utility) for acting in a state. γ — the discount factor, a number in [0,1].
Imagine you can have 1000 euros today or 1000 euros in a year. You'd prefer today — maybe there's inflation, maybe the future is uncertain. Gamma models exactly this: some reward today is worth more than the same reward tomorrow.
With γ close to 1 the agent is patient (the future barely shrinks); with γ close to 0 it's myopic (only the next reward matters). You'll feel this directly in the explorer below.
The one assumption: the Markov property
An MDP assumes the current state already contains all the information you need to predict the future. Given the present, the past is irrelevant. The prediction may still be uncertain — but adding history wouldn't make it any sharper.
It is not true that the next word you say depends only on the previous word. Typically it depends on all the words before it. So language does not satisfy the Markov property — but we assume it anyway, when we can, because it makes the model tractable.
The goal is a policy, not an action
In single-shot decision theory we wanted one best action. Here, because we act again and again, we want a policy — a rule that tells the agent what to do in every state.
Deterministic π(s)
One action per state. "If it rains, take the umbrella; if it's sunny, dress lightly." This is the sequential cousin of a pure strategy.
Stochastic π(a|s)
A probability distribution over actions in each state. The sequential cousin of a mixed strategy from game theory.
Run a policy and you get a stream: state → action → reward → state → action → reward → …. That stream is an episode. If it always ends (chess, Go — someone wins or it's a draw), the task is episodic. If it can go on forever (your relationship with a recommender system), it's continuing.
Value functions & the Bellman idea
To find a good policy we first need to measure how good a situation is. That's what value functions do.
The thing we actually want to maximise is the discounted cumulative reward — add up all future rewards, shrinking each one a bit more by γ the further off it is. Think of it as the sequential version of expected utility.
State value Vπ(s) = "if I start in state s and follow policy π forever, how much reward do I expect in total?" It scores how good it is to be in a state.
Action value Qπ(s,a) = "…and if I commit to action a first, then follow π?" It scores how good a particular move is in a state. (Q is the star of reinforcement learning, later.)
The Bellman relationship: value = now + later
Here's the key insight that powers every algorithm in this lecture. The value of a state can be written in terms of the value of the next state:
If you act optimally, the value of a state is: pick the single best action, collect its immediate reward, and add the discounted value of the best place it can take you. In one line — "the value of a state = the best immediate reward plus the discounted value of the next state."
This is the professor's literal exam example. You don't memorise the symbols; you explain this.
The optimal behaviour of the agent is greedy with respect to the optimal value: in each state, take the action that maximises the value. Once you know V*, the optimal policy falls out for free.
Solving a known MDP
If the agent knows P and R, it can compute the optimal policy before ever acting — offline — by repeating the Bellman update. This family of methods is dynamic programming.
Start by guessing every state is worth 0 ("I expect nothing"). Then sweep through the states applying the Bellman optimality update again and again. Each sweep propagates reward information one step further back through the states. Keep going until the values barely change — that fixed point is the optimal value function. Read off the greedy policy and you're done.
Watch it happen 🤖
The professor's running example: a robot in a 4-region corridor. Reaching Region 4 (the goal) pays +10; every move costs −1. Transitions are deterministic (go right → next region, go left → previous). Press Step to apply one Bellman sweep and watch the values fill in from the goal backwards — and the optimal policy (arrows) appear.
Goal = reach Region 4 (+10) · every move costs −1. Values start at 0 and converge to the optimal V; arrows show the greedy policy.
With γ = 0.9 it converges to V(1)=5.39, V(2)=7.1, V(3)=9 — and the policy is "always go right," exactly as intuition says. Slide γ down and watch the far-off goal lose its pull on the early regions.
Policy iteration — the sibling method
Same goal, different rhythm. Instead of only updating values and reading the policy at the end, policy iteration alternates two steps: (1) evaluate the current policy's values, (2) improve the policy greedily — repeat until the policy stops changing.
Value iteration: cheaper per step, but needs more steps. Policy iteration: fewer steps, but each step costs more (it updates the policy too). Neither is universally better — and both are guaranteed to converge to the optimal policy. They're both special cases of dynamic programming.
Reinforcement learning
Now drop the assumption. The agent does not know how the environment works — no transition model, maybe not even the rewards. The only way to learn is to interact and observe.
Exploration means trying new actions to improve your knowledge. Exploitation means, if you know an action gives a good reward, keep using it. Only exploit and you may miss something better; only explore and you waste effort even when you already know what's best. The art is the balance.
Q-learning is value iteration's cousin for the unknown-model world. It estimates the action-value Q(s,a) directly from experience. After each real step, it nudges its current estimate toward a target = the reward you just got + the discounted value of the best next move. The size of the nudge is the learning rate α.
You should recognise this update: a current estimate, plus a step in the direction you want to move. It's a kind of gradient descent — only here you don't compute a gradient, you use this temporal-difference error.
With a known model, dynamic programming always converged. Q-learning converges to the optimal Q only under conditions: (1) every state–action pair is visited infinitely often (so the task is effectively continuing, not episodic), and (2) the learning rates shrink just right — Σα = ∞ but Σα² < ∞ (e.g. α = 1/k). Then, in the limit, Q-learning finds the optimal policy — the same "always go right" the robot found by dynamic programming.
Simulators & Monte Carlo Tree Search
The realistic middle ground: you can't write down the transition model, but you can simulate it. That simulator is called a generative model.
A black box that, given a state and an action, returns a plausible next state and reward — even if you can't write the formula. Examples: a game engine (the rules are tangled, but you can play it), a logistics or healthcare simulator, or a digital twin — a software replica of a costly real system. You can sample futures cheaply instead of touching reality.
The trouble: the number of possible futures is astronomical — you can't simulate them all. So, just like in RL, you must focus on the promising actions. That idea leads to one of the most influential algorithms in the field:
You may have heard of AlphaGo, which reached superhuman performance at Go. A key ingredient was not only deep learning, but — probably most importantly — an application of this Monte Carlo Tree Search technique.
Dynamic programming computes a global policy offline. Q-learning learns a global policy online. MCTS is the hybrid: online and local — it doesn't compute what to do everywhere, it figures out the best action from where you are right now, by simulating possible futures.
The four steps (repeated many times)
MCTS grows a tree whose nodes are states and whose edges are actions. Each iteration runs four steps:
① Selection
Walk down the existing tree, at each node picking the action that maximises the UCB score, until you hit a not-fully-explored node.
② Expansion
Pick one untried action, use the simulator to generate the resulting state, and add that new node to the tree.
③ Simulation (rollout)
From the new node, simulate a quick episode with a simple default policy (e.g. random) until a terminal state or a fixed horizon. Add up the reward.
④ Backpropagation
Send that reward back up the path, updating each visited node's visit count and its value (a running average of rollout returns).
After many iterations, the agent takes the best-looking action at the root. That's its locally optimal decision for right now.
Selection picks the action maximising Q(s,a) + c·√( ln N(s) / N(s,a) ). The first term is pure exploitation (take what looks best). The second is an exploration bonus: it's large for actions you've barely tried. For a never-tried action the count is 0, so the bonus shoots to infinity — guaranteeing you give every option at least one chance.
The big picture
Three settings, one question — how much does the agent know? That's the whole map. Tap each to compare.
MDP + Dynamic Programming
Knows P and R. Compute the optimal policy offline, globally (all states), via value/policy iteration. Acts greedily.
Reinforcement Learning
Knows nothing. Learns a global policy online through interaction (Q-learning), balancing explore vs exploit.
Monte Carlo Tree Search
Has a generative model. Plans online and locally — best action from the current state, by simulating futures.
Real systems mix them: use RL to guide the rollouts inside MCTS; use MCTS to make an RL agent more efficient; use a bit of dynamic programming where you do know the model, and learn the rest. The setting just tells you which tool fits the knowledge you have.
This concludes our discussion of normative decision theory — how we should decide, if we could design a fully autonomous system. But the course is called decision support systems. So next we need to understand how humans actually make decisions — because to support them, we first need to know whether they are rational.
Lectures 1–4 answered "what is the rational decision?" across single-shot, multi-agent, and now sequential settings. From here the course pivots from deciding for people to deciding with them.
The Exam Lab
The professor was unusually explicit about this lecture: explain the intuition, don't reproduce formulas. Every answer below is written that way.
Open questions where you discuss a topic (same shape as the Advanced Data Management exam). He said plainly: "I will not ask you to write explicit formulas… I want to test that you understood what the formula means." His example question: "explain the Bellman optimality condition in intuitive terms." So: ① name it · ② say what it means in plain words · ③ give the intuition / an example.
Explain the Bellman optimality condition in intuitive terms.
What is a Markov Decision Process? Describe its components and the Markov property. Discuss.
Explain how value iteration finds the optimal policy, and why it works. How does it compare to policy iteration? Discuss.
When the model is unknown, what changes? Explain reinforcement learning, the exploration–exploitation trade-off, and the idea of Q-learning. Discuss.
Explain Monte Carlo Tree Search: its setting, its four steps, and what the UCB rule achieves. Discuss.
Compare the three approaches to sequential decision making by the knowledge the agent has. Discuss.
• Explain the Bellman optimality condition in one plain sentence.
• Why do we need γ (the discount factor)? Give the 1000-euros example.
• What is the Markov property — and one place it fails?
• What does each value-iteration sweep do, and why does it converge?
• Value iteration vs policy iteration — one line each.
• Explore vs exploit: why does an agent need both?
• MCTS: name the four steps and say what UCB does for an untried action.
• The knowledge spectrum: full model / no model / simulator → which technique, online or offline, global or local?