introduction to markov decision processes and dynamic ... · introduction to markov decision...

47
Introduction to Markov Decision Processes and Dynamic Programming Judith B¨ utepage and Marcus Klasson KTH, Royal Institute of Technology, Stockholm [email protected], [email protected] February 14, 2017 Judith B¨ utepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 1 / 46

Upload: others

Post on 19-Jun-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Introduction to Markov Decision Processes and DynamicProgramming

Judith Butepage and Marcus Klasson

KTH, Royal Institute of Technology, Stockholm

[email protected], [email protected]

February 14, 2017

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 1 / 46

Page 2: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Overview

1 Introduction to Markov Decision ProcessesFormal Modelling of RL TasksValue FunctionsBellman and his equationsOptimal Value Function

2 Dynamic ProgrammingPolicy EvaluationPolicy ImprovementPolicy IterationValue Iteration

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 2 / 46

Page 3: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

The Agent-Environment Interaction

St œ S, S is the set of possible statesAt œ A(St), A(St) is the set of actions in state St

Rt+1 œ R µ R, is a numerical rewardfit(a|s), a policy denoting the probability of

choosing action At = a in state St = s

The agent’s goal is to maximize the total amount of reward it receives over the long run.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 3 / 46

Page 4: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Help us to maximize our rewards!

The states are the slides of this lecture.The actions are your reactions.We get more reward when you understand and when you ask questions.

So raise your hand and do not get lost in this mathematical jungle!

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 4 / 46

Page 5: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

A Short Discourse Into Multi-Armed Bandits

The agent can choose between k actions and receives a reward for each action.The expected reward for taking action a at time t is

qú(a) = E[Rt |At = a].

If the agent has chosen actions up to time t, the average received reward is

Qt(a) =qt≠1

i=1 Ri · 1(Ai = a)qt≠1

i=1 1(Ai = a).

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 5 / 46

Page 6: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Multi-Armed Bandits Example - Dragon FinderWe can choose the actions

A = {d1, d2, d3}

We have chosen actions and received rewards

A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 6 / 46

Page 7: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Multi-Armed Bandits Example - Dragon Finder

We have chosen actions and received rewards

A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]

Then we have

Qt(d1) = (2.6 + 3.4)2

= 3

Qt(d2) = (1.1 + 0.8)2

= 0.95

Qt(d3) = (6.1 + 4.6 + 5.2)3

= 5.3

We can be greedy and exploit this function by choosing the action that gives us the highestexpected reward.Or we can explore our action space and choose a random action with probability ‘.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 7 / 46

Page 8: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Multi-Armed Bandits Example - Graph ‘-greedy

Steps

0 100 200 300 400 500 600 700 800 900 1000

Ave

rag

e R

ew

ard

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ϵ = 0 (greedy)

ϵ = 0.01

ϵ = 0.1

Comparing greedy method with two ‘-greedy (‘ = 0.01 and ‘ = 0.1). Rewards are Normallydistributed as

Rd ≥ N (µd , ‡d ), µ = [3, 1, 5], ‡ = [0.5, 0.25, 1].

Takes t = 1000 steps and is averaged over 1000 runs.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 8 / 46

Page 9: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Markov Decision Processes

A Markov Decision Process (MDP) is defined by a 5-tuple (S, A, p(), R, “)

S is a finite set of possible statesA(St) is a finite set of actions in state St

p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1

A finite MDP has a finite number of states and actions.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 9 / 46

Page 10: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

The Valentine’s Dilemma

The final goal of the princess is to rescue her prince. However, there are obstacles on the way.Valentine’s day is only ONCE a year, so she needs to be fast!For every step she gets a reward of -1, unless she meets a dragon and needs to fight it. Thenthe reward is -5.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 10 / 46

Page 11: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Goals and Rewards

Goal: The maximization of the expected value of the cumulative sum of a received scalar signal(called reward).Reward signal: What we want to achieve, not how to achieve it.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 11 / 46

Page 12: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Discounted Rewards

Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state

Expected reward: Gt is some specific function of the reward sequence

Episodic task: Gt = Rt+1 + Rt+2 + Rt+3 + .... + RT

Continuing task: Gt = Rt+1 + “Rt+2 + “2Rt+3 + “3Rt+4 + ....

=Œÿ

k=0

“kRt+k+1

0 Æ “ Æ 1 is called the discount rate.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 12 / 46

Page 13: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Unified Notation

Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state

Gt =T≠t≠1ÿ

k=0

“kRt+k+1

T can be Œ, 0 Æ “ Æ 1, but not T = Œ and “ = 1

Myopic agent: “ = 0 Far-sighted agent: “ æ 1

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 13 / 46

Page 14: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

State Representations

Representation 1 Representation 2 Representation 3

A state can include sensory signals, abstract environmental information or even mental states.However, it should only contain information relevant for decision making.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 14 / 46

Page 15: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

The Valentine’s Dilemma - The Markov Property

Generally, the current response could depend on the entire past:p(St+1 = sÕ, Rt+1 = r |S0, A0, R1, . . . , St≠1, At≠1, Rt , St , At)

The Markov property assumes independence of the past given the present:p(sÕ, r |s, a) .= p(St+1 = sÕ, Rt+1 = r |St = s, At = a)

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 15 / 46

Page 16: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Markov Decision ProcessesA Markov Decision Process is defined by a 5-tuple (S, A, p(), R, “)

S is a finite set of possible statesA(St) is a finite set of actions in state St

p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1

Expected rewards for state–action pair

r(s, a) .= E[Rt+1|St = s, At = a] =ÿ

rœR

rÿ

sÕœS

p(sÕ, r |s, a)

State-transition probabilities

p(sÕ|s, a) .= p(St+1 = sÕ, |St = s, At = a) =ÿ

rœR

p(sÕ, r |s, a)

Expected rewards for state–action–next-state triple

r(s, a, sÕ) .= E[Rt+1|St = s, At = a, St+1 = sÕ] =

qrœR r p(sÕ, r |s, a)

p(sÕ|s, a)

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 16 / 46

Page 17: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

MDP Transition Graph - Encountering a Dragon

Figure: Transition graph and table.States: Sm: = Smashed against the wall, Fi: = Fighting, Wo: = Won.Actions: A: = Attacking, H: = Hitting, S: = Sneaking past the dragon.Functions: [p(s’|s,a), r(s,a,s’)]

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 17 / 46

Page 18: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Value Functions

Efi[Gt ] denotes the expectation of Gt when following policy fi(a|s).

State–value function for policy fi

vfi(s) .= Efi[Gt |St = s] = Efi

CŒÿ

k=0

“kRt+k+1|St = s

D

Action–value function for policy fi

qfi(s, a) .= Efi[Gt |St = s, At = a] = Efi

CŒÿ

k=0

“kRt+k+1|St = s, At = a

D

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 18 / 46

Page 19: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Bellman Equation for State–Value Functions

Figure: Richard Ernest Bellman (August 26, 1920 - March 19, 1984)

vfi(s) .= Efi[Gt |St = s]

= Efi

CŒÿ

k=0

“kRt+k+1|St = s

D

= Efi

CRt+1 + “

Œÿ

k=0

“kRt+k+2|St = s

D

=ÿ

aœA

fi(a|s)ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)

Cr + “Efi

CŒÿ

k=0

“kRt+k+2|St+1 = sÕ

DD

=ÿ

aœA

fi(a|s)ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)#r + “vfi(sÕ)

$, ’s œ S

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 19 / 46

Page 20: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Bellman Equation for Action–Value functions

qfi(s, a) = Efi

CŒÿ

k=0

“kRt+k+1|St = s, At = a

D

= ...

=ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)

Cr + “

ÿ

aÕœA

fi(aÕ|sÕ)qfi(sÕ, aÕ)

D, ’s œ S, ’a œ A

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 20 / 46

Page 21: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Backup Diagrams

(a) vfi(s) =ÿ

aœA

fi(a|s)ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)#r + “vfi(sÕ)

$, ’s œ S

(b) qfi(s, a) =ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)

Cr + “

ÿ

aÕœA

fi(aÕ|sÕ)qfi(sÕ, aÕ)

D, ’s œ S, ’a œ A

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 21 / 46

Page 22: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Optimal Value Function

We say that policy fi is better than fiÕ iffi Ø fiÕ i� vfi(s) Ø vfiÕ (s) ’s œ S

It is always the case that÷fi : fi Ø fiÕ ’fiÕ, where fi is the optimal policy fiú and

vú(s) .= maxfi

vfi(s), ’s œ S is the optimal state-value function

qú(s, a) .= maxfi

qfi(s, a), ’s œ S, ’a œ A(s) is the optimal action-value function.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 22 / 46

Page 23: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Bellman Optimality Equation

vú(s) = maxaœA(s)

qú(s, a)

= maxa

Efiú [Gt |St = s, At = a]

= maxa

Efiú

CŒÿ

k=0

“kRt+k+1|St = s, At = a

D

= maxa

Efiú

CRt+1 + “

Œÿ

k=0

“kRt+k+2|St = s, At = a

D

= maxa

Efiú [Rt+1 + “vú(St+1)|St = s, At = a]

= maxaœA(s)

ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vú(sÕ)]

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 23 / 46

Page 24: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Bellman Optimality Equation - Backup Diagrams

(a) vú(s) = maxaœA(s)

ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vú(sÕ)]

(b) qú(s, a) =ÿ

sÕ,r

p(sÕ, r |s, a)[r + “ maxaÕœA(s)

qú(sÕ, aÕ)]

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 24 / 46

Page 25: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Introduction to Dynamic Programming

In general, Dynamic Programming techniques optimize subproblems of the main problem toreach a globally optimal solution.In the context of RL, Dynamic Programming is a collection of algorithms that can compute theoptimal value function of a finite MDP given a perfect model of the environment.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 25 / 46

Page 26: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Evaluating a Policy fi

We have a policy fi(a|s) and want to compute the value function vfi(s), ’s œ S.The Bellman equation can be solved directly:

vfi(s) .=ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vfi(sÕ)]

vfi(s1) =c1,0 + c1,1vfi(s1) + c1,2vfi(s2) + c1,3vfi(s3) + ...

vfi(s2) =c2,0 + c2,1vfi(s1) + c2,2vfi(s2) + c2,3vfi(s3) + ...

vfi(s3) =c3,0 + c3,1vfi(s1) + c3,2vfi(s2) + c3,3vfi(s3) + ...

vfi(s4) =...

If the MDP is not finite we are in trouble!A large number of states and action makes this approach also infeasible.Computational complexity is O(n3), where n is the number of states.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 26 / 46

Page 27: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Policy Evaluation

Assume that the environment is a finite MDP. We can use an iterative approach:

vk+1(s) .= Efi[Rt+1 + “vk(St+1)|St = s]

=ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vk(sÕ)]

This update uses an operation called full backup and is called Iterative Policy Evaluation.This will converge to the fixed point vk = vfi .

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 27 / 46

Page 28: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Iterative Policy Evaluation

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 28 / 46

Page 29: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Running Example

Shaded squares are terminal states.Actions that will take the agent o� the grid stays in the same state.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 29 / 46

Page 30: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Running Example - ”Random Policy”

vk+1(s) =ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vk(sÕ)]

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 30 / 46

Page 31: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Policy Improvement

We have a policy fi(a|s) but it is not optimal. How can we improve it?

Policy improvement theorem:If qfi(s, fiÕ(s)) Ø vfi(s) then the policy fiÕ must be as good, or better than fi.

It must obtain greater or equal returns in all states vfiÕ (s) Ø vfi(s).

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 31 / 46

Page 32: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Proof of Policy Improvement Theorem

vfi(s) Æ qfi(s, fiÕ(s))= EfiÕ [Rt+1 + “vfi(St+1)|St = s]Æ EfiÕ [Rt+1 + “qfi(St+1, fiÕ(St+1))|St = s]= EfiÕ [Rt+1 + “EfiÕ [Rt+2 + “vfi(St+2)]|St = s]= EfiÕ [Rt+1 + “Rt+2 + “2vfi(St+2)|St = s]...Æ EfiÕ [Rt+1 + “Rt+2 + “2Rt+3 + ...|St = s]= vfiÕ (s)

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 32 / 46

Page 33: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Greedy Policy

We have state-value function vfi(s), ’s œ S and greedily choose actions that maximize it.

fiÕ(s) .= arg maxa

qfi(s, a)

= arg maxa

E[Rt+1 + “vfi(St+1)|St = s, At = a]

= arg maxa

ÿ

sÕ,r

p(sÕ, r |s, a)#r + “vfi(sÕ)

$

If the greedy policy fiÕ is as good as, but not better than fi, then vfiÕ = vfi , ’s œ S.

vfiÕ (s) = maxa

E[Rt+1 + “vfiÕ (St+1)|St = s, At = a]

= maxa

ÿ

sÕ,r

p(sÕ, r |s, a)#r + “vfiÕ (sÕ)

$

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 33 / 46

Page 34: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Running Example

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 34 / 46

Page 35: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Policy Iteration

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 35 / 46

Page 36: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Generalized Policy Iteration

Any method that interleaves the two processes of policy evaluation and policy improvement fallsunder the umbrella of generalized policy iterations.The two processes of policy evaluation and policy improvement can be seen as opposing forcesthat will agree on a single joint solution in the long run.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 36 / 46

Page 37: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Value Iteration

Value iteration combines policy improvement and truncated policy evaluation steps.

vk+1(s) .= maxa

E[Rt+1 + “vk(St+1)|St = s, At = a]

= maxa

ÿ

sÕ,r

p(sÕ, r |s, a)#r + “vk(sÕ)

$, ’s œ S.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 37 / 46

Page 38: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Value Iteration

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 38 / 46

Page 39: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Convergence and Termination

All methods presented up to here are only guaranteed to converge for k æ Œ.

However, often we get reasonable results by setting a convergence criterion such as|Vk+1(s) ≠ Vk(s)| < ◊.

Dynamic programming methods scale polynomially in the number of states and actions.Therefore they are exponentially faster than any direct search in the policy space.On today’s computers MDPs with millions of states can be solved with DP methods.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 39 / 46

Page 40: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Limitations of MDPs

Stop using your pink glasses: The real world is not a video game!

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 40 / 46

Page 41: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Limitations of MDPs

Circumvent the problem of high-dimensional state and action spaces by dividing your probleminto subproblems.

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 41 / 46

Page 42: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Summary

In reinforcement learning we have an agent that interacts with its environment and receivesrewards based on its decisions. The goal is to learn to choose actions that maximizes theexpected future reward.

states – states should contain all relevant information for making decisions

actions – an action brings you from state s into state sÕ according to p(sÕ|s, a)

rewards – an agent receives rewards for being in a state

policy – a policy is a stochastic rule for choosing actions as a function of states

Markov Decision Process – (S, A, p(), R, “) + markov property

value functions – vfi(s) & qfi(s, a) summarize the expected reward for following a policy fi

policy evaluation – given policy fi(s) we iteratively compute vfi(s) ’s œ S

policy improvement – given vfi(s) improve your policy fi(s), e.g. by being greedy

policy iteration – alternate between policy evaluation and policy improvement

value iteration – combine policy evaluation and policy improvement

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 42 / 46

Page 43: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Questions?

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 43 / 46

Page 44: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

References

Sutton, Richard S and Barto, Andrew G (2016++)Reinforcement learning: An introductionPublisher: MIT press Cambridge

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 44 / 46

Page 45: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Thanks!

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 45 / 46

Page 46: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Policy Evaluation - Proof Sketch

Assume we converged at K.

vK (s) .= Efi[Rt+1 + “vK≠1(St+1)|St = s]

vK (s) =ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “Efi[vK≠1(sÕ)]¸ ˚˙ ˝qa

fi(a|sÕ)q

sÕÕ,rp(sÕÕ,r|sÕ,a)[r+“Efi [vK≠2(sÕÕ)]]

]

Since we follow fi in every step, we e�ectively will approximate vfi .

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 46 / 46

Page 47: Introduction to Markov Decision Processes and Dynamic ... · Introduction to Markov Decision Processes and Dynamic Programming Judith B¨utepage and Marcus Klasson KTH, Royal Institute

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 46 / 46