Lecture 4 · Sequential Decision Making

1the next step up

From single shots to sequences

Lectures 1–4 are a ladder of difficulty. Each rung keeps the goal — decide rationally — but makes the world harder.

🎙 the progression

We started with a single agent against an unspecified environment, single-shot. Then game theory, where the difficulty comes from the interaction between agents. But even there, the decisions are still single-shot. Now: instead of a single decision, a sequence of decisions made in time, that may influence each other.

Rung 1 · tap+

Decision theory

One agent, one shot, an indifferent environment. Pick the single best action. (Lecture 1)

Rung 2 · tap+

Game theory

Many agents, still one shot each. Difficulty = their interests interact. (Lecture 3)

Rung 3 · tap+

Sequential decisions

One agent, many decisions over time. Today's action changes tomorrow's state. (today)

Why does sequentiality matter so much? Because your decisions change the future — not just the immediate reward, but the very situations you'll face next, and even your own uncertainty about them.

★The hospital example

Imagine a decision-support system for treating patients. If you model each treatment as single-shot, you assume your advice never affects the future. But a good system improves the hospital's performance, which changes which patients arrive next. Pretending the decisions are independent gives you a suboptimal, irrational system.

The same shape shows up everywhere decisions repeat: recommender systems (Amazon/Netflix update what they show based on how you react), adaptive pricing (your trades move the asset's price), and resource allocation (how you spend oil/gas/compute now changes what's left later).

★Today's map — the "knowledge spectrum"

We'll meet three settings, sorted by how much the agent knows about its environment:

🟢 Knows everything → Markov Decision Processes (solved with dynamic programming) · 🔴 Knows nothing → Reinforcement Learning (Q-learning) · 🟡 Has a simulator → Monte Carlo Tree Search.

2the model

Markov Decision Processes

Our mathematical model for an agent interacting with an environment that changes over time.

Time is broken into discrete steps (we skip continuous time — too complex for this course). At each step the same little loop repeats:

observe state → choose action → environment transitions → receive reward…then repeat from the new state

★An MDP is a 5-part tuple ⟨S, A, P, R, γ⟩

S — the set of states the environment can be in. A — the actions the agent can take. P — the transition function: the probability of landing in state s′ if you do action a in state s. R — the reward (the utility) for acting in a state. γ — the discount factor, a number in [0,1].

🎙 why we discount the future (γ)

Imagine you can have 1000 euros today or 1000 euros in a year. You'd prefer today — maybe there's inflation, maybe the future is uncertain. Gamma models exactly this: some reward today is worth more than the same reward tomorrow.

With γ close to 1 the agent is patient (the future barely shrinks); with γ close to 0 it's myopic (only the next reward matters). You'll feel this directly in the explorer below.

The one assumption: the Markov property

★"The present is enough"

An MDP assumes the current state already contains all the information you need to predict the future. Given the present, the past is irrelevant. The prediction may still be uncertain — but adding history wouldn't make it any sharper.

🎙 when it's false — language

It is not true that the next word you say depends only on the previous word. Typically it depends on all the words before it. So language does not satisfy the Markov property — but we assume it anyway, when we can, because it makes the model tractable.

The goal is a policy, not an action

In single-shot decision theory we wanted one best action. Here, because we act again and again, we want a policy — a rule that tells the agent what to do in every state.

Policy type · tap+

Deterministic π(s)

One action per state. "If it rains, take the umbrella; if it's sunny, dress lightly." This is the sequential cousin of a pure strategy.

Policy type · tap+

Stochastic π(a|s)

A probability distribution over actions in each state. The sequential cousin of a mixed strategy from game theory.

iEpisodes: episodic vs continuing

Run a policy and you get a stream: state → action → reward → state → action → reward → …. That stream is an episode. If it always ends (chess, Go — someone wins or it's a draw), the task is episodic. If it can go on forever (your relationship with a recommender system), it's continuing.

◇ Check yourself

What does the Markov property assume?

The Markov property says the present state is a sufficient summary — given it, history is irrelevant for prediction. (Whether episodes end is the episodic-vs-continuing distinction, a separate idea.)

3what makes a state "good"

Value functions & the Bellman idea

To find a good policy we first need to measure how good a situation is. That's what value functions do.

The thing we actually want to maximise is the discounted cumulative reward — add up all future rewards, shrinking each one a bit more by γ the further off it is. Think of it as the sequential version of expected utility.

G_t = r_t + γ·r_t+1 + γ²·r_t+2 + …rewards far in the future count for less (γ < 1)

★Two value functions

State value V^π(s) = "if I start in state s and follow policy π forever, how much reward do I expect in total?" It scores how good it is to be in a state.

Action value Q^π(s,a) = "…and if I commit to action a first, then follow π?" It scores how good a particular move is in a state. (Q is the star of reinforcement learning, later.)

The Bellman relationship: value = now + later

Here's the key insight that powers every algorithm in this lecture. The value of a state can be written in terms of the value of the next state:

V(s) = immediate reward + γ · (expected value of where you land next)a recursive definition: value defined in terms of value

★The Bellman optimality equation — say this in words

If you act optimally, the value of a state is: pick the single best action, collect its immediate reward, and add the discounted value of the best place it can take you. In one line — "the value of a state = the best immediate reward plus the discounted value of the next state."

This is the professor's literal exam example. You don't memorise the symbols; you explain this.

V*(s) = max_a [ R(s,a) + γ · Σ_s′ P(s′|s,a) · V*(s′) ]the Bellman optimality equation — the "max over a" is acting optimally

🎙 from values to the policy

The optimal behaviour of the agent is greedy with respect to the optimal value: in each state, take the action that maximises the value. Once you know V*, the optimal policy falls out for free.

4setting 1 · full knowledge

Solving a known MDP

If the agent knows P and R, it can compute the optimal policy before ever acting — offline — by repeating the Bellman update. This family of methods is dynamic programming.

★Value iteration, in one breath

Start by guessing every state is worth 0 ("I expect nothing"). Then sweep through the states applying the Bellman optimality update again and again. Each sweep propagates reward information one step further back through the states. Keep going until the values barely change — that fixed point is the optimal value function. Read off the greedy policy and you're done.

Watch it happen 🤖

The professor's running example: a robot in a 4-region corridor. Reaching Region 4 (the goal) pays +10; every move costs −1. Transitions are deterministic (go right → next region, go left → previous). Press Next step to apply one Bellman pass — each cell shows its new value (and what it was), and a readable breakdown of why appears below.

🤖 Value Iteration Explorer

A robot learns the best way out of a 4-region corridor — by working out how good each region is.

How to read this. The number in each region means “how much total reward I expect if I start here and then act optimally.” We don’t know it yet, so every region starts at 0 and the guesses get better one pass at a time.

The robot earns +10 for reaching Region 4 and pays −1 per move — so stepping into the goal is worth +10−1 = +9.
Press Next step: every region asks “of my moves (▶ right / ◀ left), which gives the best reward now + γ × the next region’s current value?” and keeps that number.
Each pass lets the goal’s reward seep one region further left. When nothing changes any more, that’s the answer.

Sweep 0 patience γ 0.90

Region 1

0.00

Region 2

0.00

Region 3

0.00

Region 4 · GOAL

🏁

+10 reward

With γ = 0.9 it converges to V(1)=5.39, V(2)=7.1, V(3)=9 — and the policy is "always go right," exactly as intuition says. Slide γ down and watch the far-off goal lose its pull on the early regions.

Policy iteration — the sibling method

Same goal, different rhythm. Instead of only updating values and reading the policy at the end, policy iteration alternates two steps: (1) evaluate the current policy's values, (2) improve the policy greedily — repeat until the policy stops changing.

⇄Which to use?

Value iteration: cheaper per step, but needs more steps. Policy iteration: fewer steps, but each step costs more (it updates the policy too). Neither is universally better — and both are guaranteed to converge to the optimal policy. They're both special cases of dynamic programming.

◇ Check yourself

In value iteration, what is each sweep actually doing?

Value iteration repeatedly applies the Bellman optimality update; each sweep lets reward information flow one more state backwards, until the values reach the optimal fixed point. (Trying actions for real = RL; averaging rollouts = MCTS.)

5setting 2 · no knowledge

Reinforcement learning

Now drop the assumption. The agent does not know how the environment works — no transition model, maybe not even the rewards. The only way to learn is to interact and observe.

🎙 the core tension: explore vs exploit

Exploration means trying new actions to improve your knowledge. Exploitation means, if you know an action gives a good reward, keep using it. Only exploit and you may miss something better; only explore and you waste effort even when you already know what's best. The art is the balance.

★Q-learning, in one breath

Q-learning is value iteration's cousin for the unknown-model world. It estimates the action-value Q(s,a) directly from experience. After each real step, it nudges its current estimate toward a target = the reward you just got + the discounted value of the best next move. The size of the nudge is the learning rate α.

Q(s,a) ← Q(s,a) + α · [ r + γ·max_a′Q(s′,a′) − Q(s,a) ]the bracket is the "temporal-difference error": target minus current guess

🎙 if you've seen gradient descent…

You should recognise this update: a current estimate, plus a step in the direction you want to move. It's a kind of gradient descent — only here you don't compute a gradient, you use this temporal-difference error.

!Convergence needs more than before

With a known model, dynamic programming always converged. Q-learning converges to the optimal Q only under conditions: (1) every state–action pair is visited infinitely often (so the task is effectively continuing, not episodic), and (2) the learning rates shrink just right — Σα = ∞ but Σα² < ∞ (e.g. α = 1/k). Then, in the limit, Q-learning finds the optimal policy — the same "always go right" the robot found by dynamic programming.

◇ Check yourself

Why does reinforcement learning need to interact with the environment, while dynamic programming doesn't?

Dynamic programming knows P and R, so it computes the policy offline. RL faces an unknown environment — experience is its only source of information, which is exactly why exploration matters.

6setting 3 · a simulator

Simulators & Monte Carlo Tree Search

The realistic middle ground: you can't write down the transition model, but you can simulate it. That simulator is called a generative model.

★What's a generative model?

A black box that, given a state and an action, returns a plausible next state and reward — even if you can't write the formula. Examples: a game engine (the rules are tangled, but you can play it), a logistics or healthcare simulator, or a digital twin — a software replica of a costly real system. You can sample futures cheaply instead of touching reality.

The trouble: the number of possible futures is astronomical — you can't simulate them all. So, just like in RL, you must focus on the promising actions. That idea leads to one of the most influential algorithms in the field:

🎙 this is the AlphaGo trick

You may have heard of AlphaGo, which reached superhuman performance at Go. A key ingredient was not only deep learning, but — probably most importantly — an application of this Monte Carlo Tree Search technique.

★MCTS is "planning": online and local

Dynamic programming computes a global policy offline. Q-learning learns a global policy online. MCTS is the hybrid: online and local — it doesn't compute what to do everywhere, it figures out the best action from where you are right now, by simulating possible futures.

The four steps (repeated many times)

MCTS grows a tree whose nodes are states and whose edges are actions. Each iteration runs four steps:

Step 1 · tap+

① Selection

Walk down the existing tree, at each node picking the action that maximises the UCB score, until you hit a not-fully-explored node.

Step 2 · tap+

② Expansion

Pick one untried action, use the simulator to generate the resulting state, and add that new node to the tree.

Step 3 · tap+

③ Simulation (rollout)

From the new node, simulate a quick episode with a simple default policy (e.g. random) until a terminal state or a fixed horizon. Add up the reward.

Step 4 · tap+

④ Backpropagation

Send that reward back up the path, updating each visited node's visit count and its value (a running average of rollout returns).

After many iterations, the agent takes the best-looking action at the root. That's its locally optimal decision for right now.

★UCB — how selection balances explore/exploit

Selection picks the action maximising Q(s,a) + c·√( ln N(s) / N(s,a) ). The first term is pure exploitation (take what looks best). The second is an exploration bonus: it's large for actions you've barely tried. For a never-tried action the count is 0, so the bonus shoots to infinity — guaranteeing you give every option at least one chance.

◇ Check yourself

In MCTS's UCB rule, what happens to an action that has never been tried?

An untried action has visit count 0 in the denominator, pushing the exploration term to +∞ — UCB's built-in guarantee that nothing promising is left untested.

7zoom out

The big picture

Three settings, one question — how much does the agent know? That's the whole map. Tap each to compare.

🟢 Full model · tap+

MDP + Dynamic Programming

Knows P and R. Compute the optimal policy offline, globally (all states), via value/policy iteration. Acts greedily.

🔴 No model · tap+

Reinforcement Learning

Knows nothing. Learns a global policy online through interaction (Q-learning), balancing explore vs exploit.

🟡 Simulator · tap+

Monte Carlo Tree Search

Has a generative model. Plans online and locally — best action from the current state, by simulating futures.

✓They're not exclusive

Real systems mix them: use RL to guide the rollouts inside MCTS; use MCTS to make an RL agent more efficient; use a bit of dynamic programming where you do know the model, and learn the rest. The setting just tells you which tool fits the knowledge you have.

🎙 where the course turns next

This concludes our discussion of normative decision theory — how we should decide, if we could design a fully autonomous system. But the course is called decision support systems. So next we need to understand how humans actually make decisions — because to support them, we first need to know whether they are rational.

Lectures 1–4 answered "what is the rational decision?" across single-shot, multi-agent, and now sequential settings. From here the course pivots from deciding for people to deciding with them.

★study like the exam

The Exam Lab

The professor was unusually explicit about this lecture: explain the intuition, don't reproduce formulas. Every answer below is written that way.

📋 Format — straight from the professor

Open questions where you discuss a topic (same shape as the Advanced Data Management exam). He said plainly: "I will not ask you to write explicit formulas… I want to test that you understood what the formula means." His example question: "explain the Bellman optimality condition in intuitive terms." So: ① name it · ② say what it means in plain words · ③ give the intuition / an example.

★ His literal example · Bellman optimality

Explain the Bellman optimality condition in intuitive terms.

① WHATIt's a recursive description of the optimal value of a state in a sequential decision problem — value defined in terms of value.

② IN WORDSThe optimal value of a state equals the best immediate reward you can get there, plus the discounted value of the next state you'd land in — assuming you keep acting optimally. The "best" is a maximisation over actions: in every state you imagine taking the single action that maximises reward-now-plus-discounted-future.

③ INTUITION"How good is it to be here?" = "the best I can grab right now" + "how good the place I move to is, shrunk by γ because the future is worth a little less." Once you know these optimal values, the optimal policy is just to act greedily — in each state pick the action that maximises this quantity.

Question · What is an MDP?

What is a Markov Decision Process? Describe its components and the Markov property. Discuss.

① DEFINEAn MDP is a model of an agent interacting with a changing environment over discrete time, given by a tuple ⟨S, A, P, R, γ⟩: states, actions, a transition function (probability of the next state given state+action), a reward function (utility of acting in a state), and a discount factor γ ∈ [0,1].

② MARKOVThe key assumption is the Markov property: the current state contains all the information needed to predict the future — given the present, the past is irrelevant. It may not hold in reality (e.g. language, where the next word depends on the whole sentence), but we assume it for tractability.

③ GOALBecause decisions repeat, the goal isn't a single action but a policy (deterministic or stochastic) mapping states to actions, chosen to maximise the expected discounted cumulative reward.

Question · Value iteration

Explain how value iteration finds the optimal policy, and why it works. How does it compare to policy iteration? Discuss.

① HOWStart with all state values at 0, then repeatedly apply the Bellman optimality update. Each sweep propagates reward information one step further back through the states; the values converge to the optimal value function, from which the optimal policy is read off greedily.

② WHYRepeatedly applying the Bellman optimality update is a contraction with a unique fixed point — the optimal value function — so as long as the agent knows P and R, it's guaranteed to converge (and can be done offline, without acting).

③ VS POLICY ITERATIONPolicy iteration alternates evaluating the current policy and improving it greedily. It needs fewer iterations but each is costlier; value iteration is cheaper per step but needs more. Both are dynamic programming and both converge to the optimal policy.

Question · Reinforcement learning

When the model is unknown, what changes? Explain reinforcement learning, the exploration–exploitation trade-off, and the idea of Q-learning. Discuss.

① CHANGEIf the agent doesn't know the transition function or rewards, it can't plan offline — the only way to learn the optimal behaviour is to interact with the environment and learn from the feedback. That's reinforcement learning.

② TRADE-OFFExploration = try new actions to gain knowledge; exploitation = reuse actions known to be good. Pure exploitation may miss something better; pure exploration wastes effort. Good RL balances the two.

③ Q-LEARNINGLike value iteration but for the action-value Q(s,a), learned from experience: after each step, nudge the estimate toward "reward now + discounted value of the best next move," by a learning-rate-sized step (a temporal-difference update, like a gradient step). It converges if every state–action pair is visited infinitely often and the learning rates shrink appropriately.

Question · Monte Carlo Tree Search

Explain Monte Carlo Tree Search: its setting, its four steps, and what the UCB rule achieves. Discuss.

① SETTINGThe hybrid "planning" case: we can't write the model down but we have a generative model (simulator) — a game engine, a digital twin, etc. MCTS plans online and locally: it finds the best action from the current state, not a global policy.

② FOUR STEPSSelection (walk the tree by the UCB policy) → Expansion (add a new node for an untried action via the simulator) → Simulation/rollout (play out a quick episode with a default policy, sum the reward) → Backpropagation (update visit counts and value averages up the path). Repeat many times, then take the best action at the root.

③ UCBSelection maximises value + an exploration bonus that's large for rarely tried actions (infinite for untried ones). This is exactly the explore/exploit balance: exploit good actions, but keep sampling uncertain ones.

Question · Compare the three settings

Compare the three approaches to sequential decision making by the knowledge the agent has. Discuss.

FULL MODELMDP + dynamic programming (value/policy iteration). Knows P and R → computes the optimal policy offline, globally. Acts greedily.

NO MODELReinforcement learning (Q-learning). Knows nothing → learns a global policy online by interacting; balances explore/exploit. Policy may be stochastic since knowledge is incomplete.

SIMULATORMonte Carlo Tree Search. Has a generative model → plans online and locally by simulating futures. The three are complementary and often combined.

🗣 Say these out loud (cover the page)

• Explain the Bellman optimality condition in one plain sentence.
• Why do we need γ (the discount factor)? Give the 1000-euros example.
• What is the Markov property — and one place it fails?
• What does each value-iteration sweep do, and why does it converge?
• Value iteration vs policy iteration — one line each.
• Explore vs exploit: why does an agent need both?
• MCTS: name the four steps and say what UCB does for an untried action.
• The knowledge spectrum: full model / no model / simulator → which technique, online or offline, global or local?

When one choice shapes the next

🎯 How the professor will test this

From single shots to sequences

Decision theory

Game theory

Sequential decisions

Markov Decision Processes

The one assumption: the Markov property

The goal is a policy, not an action

Deterministic π(s)

Stochastic π(a|s)

Value functions & the Bellman idea

The Bellman relationship: value = now + later

Solving a known MDP

Watch it happen 🤖

Policy iteration — the sibling method

Reinforcement learning

Simulators & Monte Carlo Tree Search

The four steps (repeated many times)

① Selection

② Expansion

③ Simulation (rollout)

④ Backpropagation

The big picture

MDP + Dynamic Programming

Reinforcement Learning

Monte Carlo Tree Search

The Exam Lab