markov decision processes infinite horizon problems
DESCRIPTION
Markov Decision Processes Infinite Horizon Problems. Alan Fern *. * Based in part on slides by Craig Boutilier and Daniel Weld. What is a solution to an MDP? . MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value” - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/1.jpg)
1
Markov Decision ProcessesInfinite Horizon Problems
Alan Fern *
* Based in part on slides by Craig Boutilier and Daniel Weld
![Page 2: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/2.jpg)
2
What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”
h This depends on how we define the value of a policy
h There are several choices and the solution algorithms depend on the choice
h We will consider two common choices5 Finite-Horizon Value5 Infinite Horizon Discounted Value
![Page 3: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/3.jpg)
3
Discounted Infinite Horizon MDPsh Defining value as total reward is problematic with
infinite horizons (r1 + r2 + r3 + r4 + …..)5 many or all policies have infinite expected reward5 some MDPs are ok (e.g., zero-cost absorbing states)
h “Trick”: introduce discount factor 0 ≤ β < 15 future rewards discounted by β per time step
h Note:
h Motivation: economic? prob of death? convenience?
],|[)(0
sREsVt
tt
max
0
max
11][)( RREsV
t
t
Bounded Value
![Page 4: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/4.jpg)
![Page 5: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/5.jpg)
5
Notes: Discounted Infinite Horizon
h Optimal policies guaranteed to exist (Howard, 1960)5 I.e. there is a policy that maximizes value at each state
h Furthermore there is always an optimal stationary policy5 Intuition: why would we change action at s at a new time
when there is always forever ahead
h We define to be the optimal value function.5 That is, for some optimal stationary π
)(* sV)()(* sVsV
![Page 6: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/6.jpg)
6
Computational Problems
h Policy Evaluation5 Given and an MDP compute
h Policy Optimization5 Given an MDP, compute an optimal policy and .5 We’ll cover two algorithms for doing this: value iteration
and policy iteration
![Page 7: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/7.jpg)
7
Policy Evaluationh Value equation for fixed policy
h Equation can be derived from original definition of infinite horizon discounted value
)'(' )'),(,(β)()( ss VsssTsRsV
immediate rewarddiscounted expected valueof following policy in the future
![Page 8: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/8.jpg)
8
Policy Evaluationh Value equation for fixed policy
h How can we compute the value function for a fixed policy?5 we are given R, T, and and want to find for each s5 linear system with n variables and n constraints
g Variables are values of states: V(s1),…,V(sn)g Constraints: one value equation (above) per state
5 Use linear algebra to solve for V (e.g. matrix inverse)
)'(' )'),(,(β)()( ss VsssTsRsV
![Page 9: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/9.jpg)
![Page 10: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/10.jpg)
10
Policy Evaluation via Matrix InverseVπ and R are n-dimensional column vector (one
element for each state)
T is an nxn matrix s.t.
RIV
RVI
VRV
1-βT)(
βT)(
βT
)),(,T(s j)T(i, i ji ss
![Page 11: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/11.jpg)
11
Computing an Optimal Value Function
h Bellman equation for optimal value function
h Bellman proved this is always true for an optimal value function
)'(' *)',,(maxβ)()(* ss VsasTsRsVa
immediate rewarddiscounted expected valueof best action assuming wewe get optimal value in future
![Page 12: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/12.jpg)
12
Computing an Optimal Value Functionh Bellman equation for optimal value function
h How can we solve this equation for V*?5 The MAX operator makes the system non-linear, so the problem is
more difficult than policy evaluation
h Idea: lets pretend that we have a finite, but very, very long, horizon and apply finite-horizon value iteration 5 Adjust Bellman Backup to take discounting into account.
)'(' *)',,(maxβ)()(* ss VsasTsRsVa
![Page 13: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/13.jpg)
Bellman Backups (Revisited)
a1
a2
s4
s1
s3
s2
Vk
0.70.3
0.4
0.6
ComputeExpectations
Vk+1(s) s
ComputeMax
)'(' )',,(max)()(1 ss VsasTsRsV kk
a
![Page 14: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/14.jpg)
14
Value Iterationh Can compute optimal policy using value iteration
based on Bellman backups, just like finite-horizon problems (but include discount term)
h Will it converge to optimal value function as k gets large? 5 Yes.
h Why?
)'(' )',,(max)()(
0)(1
0
ss VsasTsRsV
sVkk
a
*lim VV kk
;; Could also initialize to R(s)
![Page 15: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/15.jpg)
15
Convergence of Value Iterationh Bellman Backup Operator: define B to be an
operator that takes a value function V as input and returns a new value function after a Bellman backup
h Value iteration is just the iterative application of B:
)'(' )',,(maxβ)()]([ ss VsasTsRsVBa
][
01
0
kk VBV
V
![Page 16: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/16.jpg)
16
Convergence: Fixed Point Property
h Bellman equation for optimal value function
h Fixed Point Property: The optimal value function is a fixed-point of the Bellman Backup operator B.5 That is B[V*]=V*
)'(' *)',,(maxβ)()(* ss VsasTsRsVa
)'(' )',,(maxβ)()]([ ss VsasTsRsVBa
![Page 17: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/17.jpg)
17
Convergence: Contraction Propertyh Let ||V|| denote the max-norm of V, which returns
the maximum element of the vector. 5 E.g. ||(0.1 100 5 6)|| = 100
h B[V] is a contraction operator wrt max-norm
h For any V and V’, || B[V] – B[V’] || ≤ β || V – V’ ||5 You will prove this.
h That is, applying B to any two value functions causes them to get closer together in the max-norm sense!
![Page 18: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/18.jpg)
![Page 19: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/19.jpg)
19
Convergenceh Using the properties of B we can prove convergence of
value iteration.
h Proof:1. For any V: || V* - B[V] || = || B[V*] – B[V] || ≤ β|| V* - V||
2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor β||V* - Vk+1 || = ||V* - B[Vk ]|| ≤ β || V* - Vk ||
3. This means that ||Vk – V*|| ≤ βk || V* - V0 ||
4. Thus 0lim * k
k VV
![Page 20: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/20.jpg)
20
Value Iteration: Stopping Conditionh Want to stop when we can guarantee the value
function is near optimal.
h Key property: (not hard to prove)
If ||Vk - Vk-1||≤ ε then ||Vk – V*|| ≤ εβ /(1-β)
h Continue iteration until ||Vk - Vk-1||≤ ε 5 Select small enough ε for desired error
guarantee
![Page 21: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/21.jpg)
21
How to Acth Given a Vk from value iteration that closely
approximates V*, what should we use as our policy?
h Use greedy policy: (one step lookahead)
h Note that the value of greedy policy may not be equal to Vk
5 Why?
)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk
a
![Page 22: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/22.jpg)
22
How to Acth Use greedy policy: (one step lookahead)
h We care about the value of the greedy policy which we denote by Vg5 This is how good the greedy policy will be in practice.
h How close is Vg to V*?
)'(' )',,(maxarg)]([ ss VsasTsVgreedy kk
a
![Page 23: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/23.jpg)
23
Value of Greedy Policy
h Define Vg to be the value of this greedy policy5 This is likely not the same as Vk
h Property: If ||Vk – V*|| ≤ λ then ||Vg - V*|| ≤ 2λβ /(1-β) 5 Thus, Vg is not too far from optimal if Vk is close to optimal
h Our previous stopping condition allows us to bound λ based on ||Vk+1 – Vk||
h Set stopping condition so that ||Vg - V*|| ≤ Δ5 How?
)'(' )',,(maxarg)]([ ss VsasTa
sVgreedy kk
![Page 24: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/24.jpg)
Property: If ||Vk – V*|| ≤ λ then ||Vg - V*|| ≤ 2λβ /(1-β)
Property: If ||Vk - Vk-1||≤ ε then ||Vk – V*|| ≤ εβ /(1-β)
Goal: ||Vg - V*|| ≤ Δ
Answer: If ||Vk - Vk-1||≤ then ||Vg - V*|| ≤ Δ
![Page 25: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/25.jpg)
Policy Evaluation Revisitedh Sometimes policy evaluation is expensive due to
matrix operations
h Can we have an iterative algorithm like value iteration for policy evaluation?
h Idea: Given a policy π and MDP M, create a new MDP M[π] that is identical to M, except that in each state s we only allow a single action π(s)5 What is the optimal value function V* for M[π] ?
h Since the only valid policy for M[π] is π, V* = Vπ.
![Page 26: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/26.jpg)
Policy Evaluation Revisited
h Running VI on M[π] will converge to V* = Vπ.5 What does the Bellman backup look like here?
h The Bellman backup now only considers one action in each state, so there is no max5 We are effectively applying a backup restricted by π
)'(' )'),(,(β)()]([ ss VsssTsRsVB
Restricted Bellman Backup:
![Page 27: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/27.jpg)
27
Iterative Policy Evaluationh Running VI on M[π] is equivalent to iteratively
applying the restricted Bellman backup.
h Often become close to Vπ for small k
][
01
0
kk VBV
V
VV kk lim
Iterative Policy Evaluation:
Convergence:
![Page 28: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/28.jpg)
28
Optimization via Policy Iterationh Policy iteration uses policy evaluation as a sub
routine for optimization
h It iterates steps of policy evaluation and policy improvement
1. Choose a random policy π2. Loop: (a) Evaluate Vπ
(b) π’ = ImprovePolicy(Vπ) (c) Replace π with π’Until no improving action possible at any state
Given Vπ returns a strictlybetter policy if π isn’t optimal
![Page 29: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/29.jpg)
29
Policy Improvement
h Given Vπ how can we compute a policy π’ that is strictly better than a sub-optimal π?
h Idea: given a state s, take the action that looks the best assuming that we following policy π thereafter5 That is, assume the next state s’ has value Vπ (s’)
Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.
For each s in S, set )'(' )',,(maxarg)(' ss VsasTsa
![Page 30: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/30.jpg)
30
Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.
)'(' )',,(maxarg)(' ss VsasTsa
For any two value functions and , we write to indicate that for all states s,
Useful Properties for Proof:
1) For any and , if then
![Page 31: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/31.jpg)
31
Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.
)'(' )',,(maxarg)(' ss VsasTsa
Proof:
![Page 32: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/32.jpg)
32
Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π.
)'(' )',,(maxarg)(' ss VsasTsa
Proof:
![Page 33: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/33.jpg)
33
Optimization via Policy Iteration
1. Choose a random policy π2. Loop: (a) Evaluate Vπ (b) For each s in S, set (c) Replace π with π’Until no improving action possible at any state
)'(' )',,(maxarg)(' ss VsasTsa
Proposition: Vπ’ ≥ Vπ with strict inequality for sub-optimal π. Policy iteration goes through a sequence of improving policies
![Page 34: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/34.jpg)
34
Policy Iteration: Convergence
h Convergence assured in a finite number of iterations5 Since finite number of policies and each step
improves value, then must converge to optimal
h Gives exact value of optimal policy
![Page 35: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/35.jpg)
35
Policy Iteration Complexityh Each iteration runs in polynomial time in the
number of states and actionsh There are at most |A|n policies and PI never
repeats a policy5 So at most an exponential number of iterations5 Not a very good complexity bound
h Empirically O(n) iterations are required often it seems like O(1)5 Challenge: try to generate an MDP that requires
more than that n iterations
h Still no polynomial bound on the number of PI iterations (open problem)!5 But may have been solved recently ????…..
![Page 36: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/36.jpg)
36
Value Iteration vs. Policy Iterationh Which is faster? VI or PI
5 It depends on the problem
h VI takes more iterations than PI, but PI requires more time on each iteration5 PI must perform policy evaluation on each iteration
which involves solving a linear system
h VI is easier to implement since it does not require the policy evaluation step5 But see next slide
h We will see that both algorithms will serve as inspiration for more advanced algorithms
![Page 37: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/37.jpg)
37
Modified Policy Iterationh Modified Policy Iteration: replaces exact
policy evaluation step with inexact iterative evaluation5 Uses a small number of restricted Bellman
backups for evaluation
h Avoids the expensive policy evaluation steph Perhaps easier to implement. h Often is faster than PI and VIh Still guaranteed to converge under mild
assumptions on starting points
![Page 38: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/38.jpg)
Modified Policy Iteration
1. Choose initial value function V2. Loop: (a) For each s in S, set (b) Partial Policy Evaluation Repeat K times:
Until change in V is minimal
)'(' )',,(maxarg)( ss VsasTsa
Policy Iteration
][VBV Approx.evaluation
![Page 39: Markov Decision Processes Infinite Horizon Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816142550346895dd0b4b5/html5/thumbnails/39.jpg)
39
Recap: things you should knowh What is an MDP?h What is a policy?
5 Stationary and non-stationary
h What is a value function? 5 Finite-horizon and infinite horizon
h How to evaluate policies? 5 Finite-horizon and infinite horizon5 Time/space complexity?
h How to optimize policies? 5 Finite-horizon and infinite horizon5 Time/space complexity?5 Why they are correct?