![Page 1: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/1.jpg)
Introduction to Markov Decision Processes and DynamicProgramming
Judith Butepage and Marcus Klasson
KTH, Royal Institute of Technology, Stockholm
[email protected], [email protected]
February 14, 2017
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 1 / 46
![Page 2: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/2.jpg)
Overview
1 Introduction to Markov Decision ProcessesFormal Modelling of RL TasksValue FunctionsBellman and his equationsOptimal Value Function
2 Dynamic ProgrammingPolicy EvaluationPolicy ImprovementPolicy IterationValue Iteration
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 2 / 46
![Page 3: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/3.jpg)
The Agent-Environment Interaction
St œ S, S is the set of possible statesAt œ A(St), A(St) is the set of actions in state St
Rt+1 œ R µ R, is a numerical rewardfit(a|s), a policy denoting the probability of
choosing action At = a in state St = s
The agent’s goal is to maximize the total amount of reward it receives over the long run.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 3 / 46
![Page 4: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/4.jpg)
Help us to maximize our rewards!
The states are the slides of this lecture.The actions are your reactions.We get more reward when you understand and when you ask questions.
So raise your hand and do not get lost in this mathematical jungle!
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 4 / 46
![Page 5: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/5.jpg)
A Short Discourse Into Multi-Armed Bandits
The agent can choose between k actions and receives a reward for each action.The expected reward for taking action a at time t is
qú(a) = E[Rt |At = a].
If the agent has chosen actions up to time t, the average received reward is
Qt(a) =qt≠1
i=1 Ri · 1(Ai = a)qt≠1
i=1 1(Ai = a).
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 5 / 46
![Page 6: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/6.jpg)
Multi-Armed Bandits Example - Dragon FinderWe can choose the actions
A = {d1, d2, d3}
We have chosen actions and received rewards
A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 6 / 46
![Page 7: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/7.jpg)
Multi-Armed Bandits Example - Dragon Finder
We have chosen actions and received rewards
A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]
Then we have
Qt(d1) = (2.6 + 3.4)2
= 3
Qt(d2) = (1.1 + 0.8)2
= 0.95
Qt(d3) = (6.1 + 4.6 + 5.2)3
= 5.3
We can be greedy and exploit this function by choosing the action that gives us the highestexpected reward.Or we can explore our action space and choose a random action with probability ‘.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 7 / 46
![Page 8: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/8.jpg)
Multi-Armed Bandits Example - Graph ‘-greedy
Steps
0 100 200 300 400 500 600 700 800 900 1000
Ave
rag
e R
ew
ard
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
ϵ = 0 (greedy)
ϵ = 0.01
ϵ = 0.1
Comparing greedy method with two ‘-greedy (‘ = 0.01 and ‘ = 0.1). Rewards are Normallydistributed as
Rd ≥ N (µd , ‡d ), µ = [3, 1, 5], ‡ = [0.5, 0.25, 1].
Takes t = 1000 steps and is averaged over 1000 runs.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 8 / 46
![Page 9: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/9.jpg)
Markov Decision Processes
A Markov Decision Process (MDP) is defined by a 5-tuple (S, A, p(), R, “)
S is a finite set of possible statesA(St) is a finite set of actions in state St
p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1
A finite MDP has a finite number of states and actions.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 9 / 46
![Page 10: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/10.jpg)
The Valentine’s Dilemma
The final goal of the princess is to rescue her prince. However, there are obstacles on the way.Valentine’s day is only ONCE a year, so she needs to be fast!For every step she gets a reward of -1, unless she meets a dragon and needs to fight it. Thenthe reward is -5.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 10 / 46
![Page 11: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/11.jpg)
Goals and Rewards
Goal: The maximization of the expected value of the cumulative sum of a received scalar signal(called reward).Reward signal: What we want to achieve, not how to achieve it.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 11 / 46
![Page 12: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/12.jpg)
Discounted Rewards
Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state
Expected reward: Gt is some specific function of the reward sequence
Episodic task: Gt = Rt+1 + Rt+2 + Rt+3 + .... + RT
Continuing task: Gt = Rt+1 + “Rt+2 + “2Rt+3 + “3Rt+4 + ....
=Œÿ
k=0
“kRt+k+1
0 Æ “ Æ 1 is called the discount rate.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 12 / 46
![Page 13: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/13.jpg)
Unified Notation
Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state
Gt =T≠t≠1ÿ
k=0
“kRt+k+1
T can be Œ, 0 Æ “ Æ 1, but not T = Œ and “ = 1
Myopic agent: “ = 0 Far-sighted agent: “ æ 1
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 13 / 46
![Page 14: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/14.jpg)
State Representations
Representation 1 Representation 2 Representation 3
A state can include sensory signals, abstract environmental information or even mental states.However, it should only contain information relevant for decision making.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 14 / 46
![Page 15: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/15.jpg)
The Valentine’s Dilemma - The Markov Property
Generally, the current response could depend on the entire past:p(St+1 = sÕ, Rt+1 = r |S0, A0, R1, . . . , St≠1, At≠1, Rt , St , At)
The Markov property assumes independence of the past given the present:p(sÕ, r |s, a) .= p(St+1 = sÕ, Rt+1 = r |St = s, At = a)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 15 / 46
![Page 16: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/16.jpg)
Markov Decision ProcessesA Markov Decision Process is defined by a 5-tuple (S, A, p(), R, “)
S is a finite set of possible statesA(St) is a finite set of actions in state St
p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1
Expected rewards for state–action pair
r(s, a) .= E[Rt+1|St = s, At = a] =ÿ
rœR
rÿ
sÕœS
p(sÕ, r |s, a)
State-transition probabilities
p(sÕ|s, a) .= p(St+1 = sÕ, |St = s, At = a) =ÿ
rœR
p(sÕ, r |s, a)
Expected rewards for state–action–next-state triple
r(s, a, sÕ) .= E[Rt+1|St = s, At = a, St+1 = sÕ] =
qrœR r p(sÕ, r |s, a)
p(sÕ|s, a)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 16 / 46
![Page 17: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/17.jpg)
MDP Transition Graph - Encountering a Dragon
Figure: Transition graph and table.States: Sm: = Smashed against the wall, Fi: = Fighting, Wo: = Won.Actions: A: = Attacking, H: = Hitting, S: = Sneaking past the dragon.Functions: [p(s’|s,a), r(s,a,s’)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 17 / 46
![Page 18: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/18.jpg)
Value Functions
Efi[Gt ] denotes the expectation of Gt when following policy fi(a|s).
State–value function for policy fi
vfi(s) .= Efi[Gt |St = s] = Efi
CŒÿ
k=0
“kRt+k+1|St = s
D
Action–value function for policy fi
qfi(s, a) .= Efi[Gt |St = s, At = a] = Efi
CŒÿ
k=0
“kRt+k+1|St = s, At = a
D
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 18 / 46
![Page 19: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/19.jpg)
Bellman Equation for State–Value Functions
Figure: Richard Ernest Bellman (August 26, 1920 - March 19, 1984)
vfi(s) .= Efi[Gt |St = s]
= Efi
CŒÿ
k=0
“kRt+k+1|St = s
D
= Efi
CRt+1 + “
Œÿ
k=0
“kRt+k+2|St = s
D
=ÿ
aœA
fi(a|s)ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)
Cr + “Efi
CŒÿ
k=0
“kRt+k+2|St+1 = sÕ
DD
=ÿ
aœA
fi(a|s)ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)#r + “vfi(sÕ)
$, ’s œ S
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 19 / 46
![Page 20: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/20.jpg)
Bellman Equation for Action–Value functions
qfi(s, a) = Efi
CŒÿ
k=0
“kRt+k+1|St = s, At = a
D
= ...
=ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)
Cr + “
ÿ
aÕœA
fi(aÕ|sÕ)qfi(sÕ, aÕ)
D, ’s œ S, ’a œ A
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 20 / 46
![Page 21: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/21.jpg)
Backup Diagrams
(a) vfi(s) =ÿ
aœA
fi(a|s)ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)#r + “vfi(sÕ)
$, ’s œ S
(b) qfi(s, a) =ÿ
rœR
ÿ
sÕœS
p(sÕ, r |s, a)
Cr + “
ÿ
aÕœA
fi(aÕ|sÕ)qfi(sÕ, aÕ)
D, ’s œ S, ’a œ A
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 21 / 46
![Page 22: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/22.jpg)
Optimal Value Function
We say that policy fi is better than fiÕ iffi Ø fiÕ i� vfi(s) Ø vfiÕ (s) ’s œ S
It is always the case that÷fi : fi Ø fiÕ ’fiÕ, where fi is the optimal policy fiú and
vú(s) .= maxfi
vfi(s), ’s œ S is the optimal state-value function
qú(s, a) .= maxfi
qfi(s, a), ’s œ S, ’a œ A(s) is the optimal action-value function.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 22 / 46
![Page 23: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/23.jpg)
Bellman Optimality Equation
vú(s) = maxaœA(s)
qú(s, a)
= maxa
Efiú [Gt |St = s, At = a]
= maxa
Efiú
CŒÿ
k=0
“kRt+k+1|St = s, At = a
D
= maxa
Efiú
CRt+1 + “
Œÿ
k=0
“kRt+k+2|St = s, At = a
D
= maxa
Efiú [Rt+1 + “vú(St+1)|St = s, At = a]
= maxaœA(s)
ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vú(sÕ)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 23 / 46
![Page 24: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/24.jpg)
Bellman Optimality Equation - Backup Diagrams
(a) vú(s) = maxaœA(s)
ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vú(sÕ)]
(b) qú(s, a) =ÿ
sÕ,r
p(sÕ, r |s, a)[r + “ maxaÕœA(s)
qú(sÕ, aÕ)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 24 / 46
![Page 25: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/25.jpg)
Introduction to Dynamic Programming
In general, Dynamic Programming techniques optimize subproblems of the main problem toreach a globally optimal solution.In the context of RL, Dynamic Programming is a collection of algorithms that can compute theoptimal value function of a finite MDP given a perfect model of the environment.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 25 / 46
![Page 26: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/26.jpg)
Evaluating a Policy fi
We have a policy fi(a|s) and want to compute the value function vfi(s), ’s œ S.The Bellman equation can be solved directly:
vfi(s) .=ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vfi(sÕ)]
vfi(s1) =c1,0 + c1,1vfi(s1) + c1,2vfi(s2) + c1,3vfi(s3) + ...
vfi(s2) =c2,0 + c2,1vfi(s1) + c2,2vfi(s2) + c2,3vfi(s3) + ...
vfi(s3) =c3,0 + c3,1vfi(s1) + c3,2vfi(s2) + c3,3vfi(s3) + ...
vfi(s4) =...
If the MDP is not finite we are in trouble!A large number of states and action makes this approach also infeasible.Computational complexity is O(n3), where n is the number of states.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 26 / 46
![Page 27: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/27.jpg)
Policy Evaluation
Assume that the environment is a finite MDP. We can use an iterative approach:
vk+1(s) .= Efi[Rt+1 + “vk(St+1)|St = s]
=ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vk(sÕ)]
This update uses an operation called full backup and is called Iterative Policy Evaluation.This will converge to the fixed point vk = vfi .
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 27 / 46
![Page 28: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/28.jpg)
Iterative Policy Evaluation
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 28 / 46
![Page 29: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/29.jpg)
Running Example
Shaded squares are terminal states.Actions that will take the agent o� the grid stays in the same state.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 29 / 46
![Page 30: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/30.jpg)
Running Example - ”Random Policy”
vk+1(s) =ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “vk(sÕ)]
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 30 / 46
![Page 31: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/31.jpg)
Policy Improvement
We have a policy fi(a|s) but it is not optimal. How can we improve it?
Policy improvement theorem:If qfi(s, fiÕ(s)) Ø vfi(s) then the policy fiÕ must be as good, or better than fi.
It must obtain greater or equal returns in all states vfiÕ (s) Ø vfi(s).
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 31 / 46
![Page 32: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/32.jpg)
Proof of Policy Improvement Theorem
vfi(s) Æ qfi(s, fiÕ(s))= EfiÕ [Rt+1 + “vfi(St+1)|St = s]Æ EfiÕ [Rt+1 + “qfi(St+1, fiÕ(St+1))|St = s]= EfiÕ [Rt+1 + “EfiÕ [Rt+2 + “vfi(St+2)]|St = s]= EfiÕ [Rt+1 + “Rt+2 + “2vfi(St+2)|St = s]...Æ EfiÕ [Rt+1 + “Rt+2 + “2Rt+3 + ...|St = s]= vfiÕ (s)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 32 / 46
![Page 33: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/33.jpg)
Greedy Policy
We have state-value function vfi(s), ’s œ S and greedily choose actions that maximize it.
fiÕ(s) .= arg maxa
qfi(s, a)
= arg maxa
E[Rt+1 + “vfi(St+1)|St = s, At = a]
= arg maxa
ÿ
sÕ,r
p(sÕ, r |s, a)#r + “vfi(sÕ)
$
If the greedy policy fiÕ is as good as, but not better than fi, then vfiÕ = vfi , ’s œ S.
vfiÕ (s) = maxa
E[Rt+1 + “vfiÕ (St+1)|St = s, At = a]
= maxa
ÿ
sÕ,r
p(sÕ, r |s, a)#r + “vfiÕ (sÕ)
$
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 33 / 46
![Page 34: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/34.jpg)
Running Example
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 34 / 46
![Page 35: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/35.jpg)
Policy Iteration
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 35 / 46
![Page 36: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/36.jpg)
Generalized Policy Iteration
Any method that interleaves the two processes of policy evaluation and policy improvement fallsunder the umbrella of generalized policy iterations.The two processes of policy evaluation and policy improvement can be seen as opposing forcesthat will agree on a single joint solution in the long run.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 36 / 46
![Page 37: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/37.jpg)
Value Iteration
Value iteration combines policy improvement and truncated policy evaluation steps.
vk+1(s) .= maxa
E[Rt+1 + “vk(St+1)|St = s, At = a]
= maxa
ÿ
sÕ,r
p(sÕ, r |s, a)#r + “vk(sÕ)
$, ’s œ S.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 37 / 46
![Page 38: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/38.jpg)
Value Iteration
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 38 / 46
![Page 39: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/39.jpg)
Convergence and Termination
All methods presented up to here are only guaranteed to converge for k æ Œ.
However, often we get reasonable results by setting a convergence criterion such as|Vk+1(s) ≠ Vk(s)| < ◊.
Dynamic programming methods scale polynomially in the number of states and actions.Therefore they are exponentially faster than any direct search in the policy space.On today’s computers MDPs with millions of states can be solved with DP methods.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 39 / 46
![Page 40: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/40.jpg)
Limitations of MDPs
Stop using your pink glasses: The real world is not a video game!
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 40 / 46
![Page 41: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/41.jpg)
Limitations of MDPs
Circumvent the problem of high-dimensional state and action spaces by dividing your probleminto subproblems.
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 41 / 46
![Page 42: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/42.jpg)
Summary
In reinforcement learning we have an agent that interacts with its environment and receivesrewards based on its decisions. The goal is to learn to choose actions that maximizes theexpected future reward.
states – states should contain all relevant information for making decisions
actions – an action brings you from state s into state sÕ according to p(sÕ|s, a)
rewards – an agent receives rewards for being in a state
policy – a policy is a stochastic rule for choosing actions as a function of states
Markov Decision Process – (S, A, p(), R, “) + markov property
value functions – vfi(s) & qfi(s, a) summarize the expected reward for following a policy fi
policy evaluation – given policy fi(s) we iteratively compute vfi(s) ’s œ S
policy improvement – given vfi(s) improve your policy fi(s), e.g. by being greedy
policy iteration – alternate between policy evaluation and policy improvement
value iteration – combine policy evaluation and policy improvement
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 42 / 46
![Page 43: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/43.jpg)
Questions?
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 43 / 46
![Page 44: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/44.jpg)
References
Sutton, Richard S and Barto, Andrew G (2016++)Reinforcement learning: An introductionPublisher: MIT press Cambridge
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 44 / 46
![Page 45: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/45.jpg)
Thanks!
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 45 / 46
![Page 46: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/46.jpg)
Policy Evaluation - Proof Sketch
Assume we converged at K.
vK (s) .= Efi[Rt+1 + “vK≠1(St+1)|St = s]
vK (s) =ÿ
a
fi(a|s)ÿ
sÕ,r
p(sÕ, r |s, a)[r + “Efi[vK≠1(sÕ)]¸ ˚˙ ˝qa
fi(a|sÕ)q
sÕÕ,rp(sÕÕ,r|sÕ,a)[r+“Efi [vK≠2(sÕÕ)]]
]
Since we follow fi in every step, we e�ectively will approximate vfi .
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 46 / 46
![Page 47: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute](https://reader033.vdocuments.us/reader033/viewer/2022060219/5f08e0317e708231d4242616/html5/thumbnails/47.jpg)
Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 46 / 46