new an introduction tohgeffner/andy2.pdf · 2019. 12. 4. · a. g. barto, barcelona lectures, april...

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Andrew G. BartoDepartment of Computer Science

University of Massachusetts – Amherst

UPF Lecture 2

Autonomous Learning Laboratory – Department of Computer Science

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL


The Overall Plan



❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning


Lecture 2, Part 1: Dynamic Programming

❐ Overview of a collection of classical solution methods for MDPsknown as Dynamic Programming (DP)

❐ Show how DP can be used to compute value functions, and hence,optimal policies

❐ Discuss efficiency and utility of DP

Objectives of this part:


Policy Evaluation

State - value function for policy ! :

V!(s) = E! R

tst

= s{ } = E! " krt+k +1 s

t= s

k =0

#

$% & '

( ) *

Bellman equation for V!

:

V!(s) = ! (s, a) P

s " s

aR

s " s

a+ #V

!( " s )[ ]

" s

$a

$

— a system of S simultaneous linear equations

Policy Evaluation: for a given policy π, compute the state-value function V!

Recall:


Iterative Methods

V0! V

1!L!V

k! V

k+1 !L! V"

!

Vk +1(s)" # (s,a) P

s $ s

aR

s $ s

a + %Vk( $ s )[ ]

$ s

&a

&

a “sweep”

A sweep consists of applying a backup operation to each state.

A full policy evaluation backup:


A Small Gridworld

❐ An undiscounted episodic task❐ Nonterminal states: 1, 2, . . ., 14;❐ One terminal state (shown twice as shaded squares)❐ Actions that would take agent off the grid leave state unchanged❐ Reward is –1 until the terminal state is reached


Iterative Policy Eval for the Small Gridworld

! = random (uniform) action choices


Policy Improvement

Suppose we have computed for a deterministic policy π.V!

For a given state s, would it be better to do an action ? a ! "(s)

!

Q"(s,a) = E" r

t +1+ #V "

(st +1

) st= s,a

t= a{ }

= Ps $ s

a

$ s

% Rs $ s

a + #V "( $ s )[ ]

The value of doing a in state s is :

!

It is better to switch to action a for state s if and only if

Q" (s,a) >V" (s)


Policy Improvement Cont.

! " (s) = argmaxa

Q"(s, a)

= argmaxa

Ps ! s

a

! s

# Rs ! s

a + $V " ( ! s )[ ]

Do this for all states to get a new policy ! " that is

greedy with respect to V " :

Then V! " # V

"


Policy Improvement Cont.

!

What if V" # = V

# ?

i.e., for all s$ S, V" # (s) = max

a

Ps " s

a

" s

% Rs " s

a + &V #( " s )[ ] ?

But this is the Bellman Optimality Equation.

So V ! " = V# and both " and ! " are optimal policies.


Policy Iteration

!0"V

!0 "!

1" V

!1 "L!

*"V

*"!

*

policy evaluation policy improvement“greedification”


Value Iteration

!

Vk +1(s)" # (s,a) P

s $ s

aR

s $ s

a + %Vk( $ s )[ ]

$ s

&a

&

Recall the full policy evaluation backup:

!

Vk +1(s)"max

a

Ps # s

aR

s # s

a + $Vk( # s )[ ]

# s

%

Here is the full value iteration backup:


Asynchronous DP

❐ All the DP methods described so far require exhaustivesweeps of the entire state set.

❐ Asynchronous DP does not use sweeps. Instead it works likethis: Repeat until convergence criterion is met:

– Pick a state at random and apply the appropriatebackup

❐ Still need lots of computation, but does not get locked intohopelessly long sweeps

❐ Can you select states to backup intelligently? YES: anagent’s experience can act as a guide.


Efficiency of DP

❐ To find an optimal policy is polynomial in the number ofstates…

❐ BUT, the number of states is often astronomical, e.g., oftengrowing exponentially with the number of state variables(what Bellman called “the curse of dimensionality”).

❐ In practice, classical DP can be applied to problems with afew millions of states.

❐ Asynchronous DP can be applied to larger problems, andappropriate for parallel computation.

❐ It is surprisingly easy to come up with MDPs for which DPmethods are not practical.


Summary

❐ Policy evaluation: backups without a max❐ Policy improvement: form a greedy policy, if only locally❐ Policy iteration: alternate the above two processes❐ Value iteration: backups with a max❐ Full backups (to be contrasted later with sample backups)❐ Asynchronous DP: a way to avoid exhaustive sweeps❐ Bootstrapping: updating estimates based on other

estimates


The Overall Plan





Lecture 2, Part 2: Simple Monte Carlo Methods

❐ Simple Monte Carlo methods learn from complete samplereturns Only defined for episodic tasks

❐ Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model


(First-visit) Monte Carlo policy evaluation


Backup diagram for Simple Monte Carlo

❐ Entire episode included❐ Only one choice at each state

(unlike DP)

❐ MC does not bootstrap

❐ Time required to estimate onestate does not depend on thetotal number of states


Monte Carlo Estimation of Action Values (Q)

❐ Monte Carlo is most useful when a model is not available❐ Qπ(s,a) - average return starting from state s and action a

following π❐ Also converges asymptotically if every state-action pair is

visited infinitely often

We are really interested in estimates of V* and Q*,i.e., Monte Carlo Control


Learning about π while following ! "


Summary

❐ MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)

❐ MC methods provide an alternate policy evaluationprocess

❐ One issue to watch for: maintaining sufficient exploration exploring starts, soft policies

❐ No bootstrapping (as opposed to DP)❐ Estimating values for one policy while behaving according

to another policy: importance sampling


The Overall Plan


❐ Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience



Lecture 2, Part 2:Temporal Difference Learning

❐ Introduce Temporal Difference (TD) learning❐ Focus first on policy evaluation, or prediction, methods❐ Then extend to control methods

Objectives of this part:


TD Prediction

!

Simple (every - visit) Monte Carlo method :

V (st)"V (s

t) +# R

t$V (s

t)[ ]

Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V!

Recall:

!

The simplest TD method, TD(0) :

V (st)"V (s

t) +# r

t+1 + $V (st+1) %V (s

t)[ ]

target: the actual return after time t

target: an estimate of the return


Simple Monte Carlo

T T T TT

T T T T T

!

V (st)"V (s

t) +# R

t$V (s

t)[ ]

where Rt is the actual return following state s

t.

st

T T

T T

TT T

T TT

(constant-α MC)


cf. Dynamic Programming

V(st)! E" r

t+1 +# V(st ){ }

T

T T T

st

rt+1

st+1

T

TT

T

TT

T

T

T


Simplest TD Method

T T T TT

T T T T T

st+1

rt+1

st

V(st)! V(s

t) +" r

t+1 + # V (st+1 ) $ V(st )[ ]

TTTTT

T T T T T


TD Bootstraps and Samples

❐Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps

❐Sampling: update does not involve anexpected value MC samples DP does not sample TD samples


Advantages of TD Learning

❐ TD methods do not require a model of the environment,only experience

❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome

– Less memory– Less peak computation

You can learn without the final outcome– From incomplete sequences

❐ Both MC and TD converge (under certain assumptions),but which is faster/better?


Random Walk Example

Values learned by TD(0)aftervarious numbers of episodes

!

equiprobable transitions

" = 0.1


TD and MC on the Random Walk

Data averaged over100 sequences of episodes


You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

V(A)?

V(B)?



V(A)?



❐ The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets

❐ If we consider the sequentiality of the problem, then wewould set V(A)=.75 This is correct for the maximum likelihood estimate of a

Markov model generating the data i.e, if we do a best fit Markov model, and assume it is

exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets


Learning An Action-Value Function

Estimate Q! for the current behavior policy !.

After every transition from a nonterminal state st, do this :

Q st, a

t( )!Q s

t, a

t( ) + " r

t+1 +# Q st+1,at+1( ) $Q s

t,a

t( )[ ]

If st+1 is terminal, then Q(s

t+1, at+1 ) = 0.


Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

s, ars’

a’


Q-Learning: Off-Policy TD Control

One - step Q - learning :

Q st, a

t( )!Q s

t, a

t( ) + " r

t+1 +# maxaQ s

t+1, a( ) $Q st, a

t( )[ ]


Cliffwalking

ε−greedy, ε = 0.1


Actor-Critic Architecture

Environment

Primary Critic

Actor

Situations or States

Actions

Adaptive Critic

Primary Rewards

Effective Rewards: (involves values)


Actor-Critic Methods

❐ Explicit representation of policyas well as value function

❐ Minimal computation to selectactions

❐ Can learn an explicit stochasticpolicy

❐ Can put constraints on policies❐ Appealing as psychological and

neural models


Actor-Critic Details

TD - error is used to evaluate actions :

!t= r

t+1 + " V (st+1 ) # V(s

t)

If actions are determined by preferences, p(s, a), as follows :

!t (s, a) = Pr at = a st = s{ } = ep( s, a)

e p(s ,b)

b

",

then you can update the preferences like this :

p(st , at )# p(st ,at ) + $% t


Afterstates

❐ Usually, a state-value function evaluates states in which the agent cantake an action.

❐ But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.

❐ Why is this useful?

❐ What is this in general?


Summary

❐ TD prediction❐ Introduced one-step tabular model-free TD methods❐ Extend prediction to control by employing some form of GPI

On-policy control: Sarsa Off-policy control: Q-learning

❐ These methods bootstrap and sample, combining aspects ofDP and MC methods


The Overall Plan





Lecture 2, Part 4: Unified Perspective


n-step TD Prediction

❐ Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)


❐ Monte Carlo:

❐ TD: Use V to estimate remaining return

❐ n-step TD: 2 step return:

n-step return:

Mathematics of n-step TD Prediction

T

tT

ttttrrrrR1

3

2

21

!!

+++ ++++= """ L

)( 11

)1(

++ +=ttttsVrR !

)( 2

2

21

)2(

+++ ++=tttttsVrrR !!

)(1

3

2

21

)(

ntt

n

nt

n

ttt

n

tsVrrrrR ++

!

+++ +++++= """" L


Learning with n-step Backups

❐ Backup (on-line or off-line):

❐ Error reduction property of n-step returns

❐ Using this, you can show that n-step methods converge

!

maxs

E"{Rt

n| s

t= s}#V "

(s) $ % nmaxs

V (s) #V "(s)

n step return

Maximum error using n-step return Maximum error using V

!

"Vt(s

t) =# R

t

(n )$V

t(s

t)[ ]


Random Walk Examples

❐ How does 2-step TD work here?❐ How about 3-step TD?


A Larger Example

❐ Task: 19 staterandom walk

❐ Do you think thereis an optimal n (foreverything)?


Averaging n-step Returns

❐ n-step methods were introduced to help withTD(λ) understanding

❐ Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-

step

❐ Called a complex backup Draw each component Label with the weights for that

component

)4()2(

2

1

2

1tt

avg

t RRR +=

One backup


Forward View of TD(λ)

❐ TD(λ) is a method foraveraging all n-step backups weight by λn-1 (time since

visitation) λ-return:

❐ Backup using λ-return:

Rt

!= (1" ! ) !

n "1

n=1

#

$ Rt

(n)

!Vt(s

t) = " R

t

#$ V

t(s

t)[ ]


λ-return Weighting Function


Relation to TD(0) and MC

❐ λ-return can be rewritten as:

❐ If λ = 1, you get MC:

❐ If λ = 0, you get TD(0)

Rt

!= (1" ! ) !

n"1

n=1

T" t"1

# Rt

(n)+ !

T"t"1Rt

Rt

!= (1"1) 1

n"1

n=1

T"t"1

# Rt

(n )+ 1

T" t"1Rt= R

t

Rt

!= (1" 0) 0

n"1

n=1

T"t"1

# Rt

(n )+ 0

T" t"1Rt= R

t

(1)

Until termination After termination


Forward View of TD(λ)

❐ Look forward from each state to determine update fromfuture states and rewards:


λ-return on the Random Walk

❐ Same 19 state random walk as before❐ Why do you think intermediate values of λ are best?


Backward View of TD(λ)

❐ The forward view was for theory❐ The backward view is for mechanism

❐ New variable called eligibility trace: On each step, decay all traces by γλ and increment the

trace for the current state by 1 Accumulating trace !

et(s)" #+

et(s) =

!"et#1(s) if s $ s

t

!"et#1(s) +1 if s = s

t

% & '


Backward View

❐ Shout δt backwards over time❐ The strength of your voice decreases with temporal

distance by γλ

)()( 11 ttttttsVsVr !+= ++ "#


On-line Tabular TD(λ)

Initialize V(s) arbitrarily and e(s) = 0, for all s !S

Repeat (for each episode) :

Initialize s

Repeat (for each step of episode) :

a" action given by # for s

Take action a, observe reward, r, and next state $ s

% " r +&V( $ s ) ' V (s)

e(s)" e(s) +1

For all s :

V(s) "V(s) +(%e(s)

e(s) "&)e(s)

s" $ s

Until s is terminal


Relation of Backwards View to MC & TD(0)

❐ Using update rule:

❐ As before, if you set λ to 0, you get to TD(0)❐ If you set λ to 1, you get MC but in a better way

Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesVttt

!"=#


Forward View = Backward View

❐ The forward (theoretical) view of TD(λ) is equivalent tothe backward (mechanistic) view for off-line updating

❐ The book shows:

❐ On-line updating with small α is similar

!Vt

TD(s)

t= 0

T"1

# = $t= 0

T"1

# Isst

(%&)k" t'k

k=t

T"1

# !Vt

"(s

t)Isst

t= 0

T#1

$ = %t= 0

T#1

$ Isst

(&")k# t'k

k=t

T#1

$

!Vt

TD

(s)t= 0

T"1

# = !Vt

$(s

t)

t= 0

T"1

# Isst

Backward updates Forward updates

algebra shown in book


Control: Sarsa(λ)

❐ Save eligibility for state-actionpairs instead of just states

et(s, a) =

!"et#1(s, a) +1 if s = s

t and a = a

t

!"et#1(s,a) otherwise

$ % &

Qt+1(s, a) = Q

t(s, a) +'(

tet(s, a)

(t

= rt+1 + !Q

t(s

t+1,at+1) #Qt(s

t, a

t)


Sarsa(λ) Algorithm

Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a

Repeat (for each episode) :

Initialize s, a

Repeat (for each step of episode) :

Take action a, observe r, ! s

Choose ! a from ! s using policy derived from Q (e.g. ? - greedy)

" # r +$Q( ! s , ! a ) %Q(s, a)

e(s,a)# e(s,a) +1

For all s,a :

Q(s, a)#Q(s, a) +&"e(s, a)

e(s, a) #$'e(s, a)

s# ! s ;a # ! a

Until s is terminal


Sarsa(λ) Gridworld Example

❐ With one trial, the agent has much more information about how to getto the goal not necessarily the best way

❐ Can considerably accelerate learning


Conclusions

❐ Eligibility traces provide efficient, incremental way tocombine MC and TD Includes advantages of MC (can deal with lack of

Markov property) Includes advantages of TD (using TD error,

bootstrapping)❐ Can significantly speed learning❐ Does have a cost in computation


The Overall Plan





TD-errorδt = rt + Vt − Vt−1

V

δ

early inlearning

V

δ

learningcomplete

δr omitted

rregular predictors of z over this interval


Dopamine Neurons and TD Error

W. Schultz et al. Universite de Fribourg


Dopamine Modulated Synaptic Plasticity


Basal Ganglia as Adaptive Critic Architecture

Houk, Adams, & Barto, 1995


The Overall Plan




new an introduction tohgeffner/andy2.pdf · 2019. 12. 4. · a. g. barto, barcelona lectures, april...

Documents