CSE 190: Reinforcement Learning:An Introduction
Chapter 7: Eligibility Traces
Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton
22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
The Book: Where we areand where we’re going
• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem
• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning
• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies
33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Chapter 7: Eligibility Traces
44CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Simple Monte Carlo
T T T TT
T T T T T
V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
Rt
55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Simplest TD Method
T T T TT
T T T T T
st+1rt+1
st
V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]
TTTTT
T T T T T
66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Is there something in between?
T T T TT
T T T T T
st+1rt+1
st
TTTTT
T T T T T
77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
N-step TD Prediction
• Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)
88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• Monte Carlo:
• TD:• I.e., use V to estimate remaining return
• n-step TD:• 2 step return:
• n-step return:
Mathematics of N-step TD Prediction
Rt = rt+1 + ! rt+2 + !2rt+3 +!+ ! T " t"1rT
Rt(1) = rt+1 + !Vt (st+1)
Rt(2) = rt+1 + ! rt+2 + !
2Vt (st+2 )
Rt(n) = rt+1 + ! rt+2 + !
2rt+3 +!+ ! n"1rt+n + !nVt (st+n )
Note the “T”
99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• Monte Carlo:
• TD:• I.e., use V to estimate remaining return
• n-step TD:• 2 step return:
• n-step return:
Mathematics of N-step TD Prediction
Rt = rt+1 + ! rt+2 + !2rt+3 +!+ ! T " t"1rT
Rt(1) = rt+1 + !Vt (st+1)
Rt(2) = rt+1 + ! rt+2 + !
2Vt (st+2 )
Rt(n) = rt+1 + ! rt+2 + !
2rt+3 +!+ ! n"1rt+n + !nVt (st+n )
Note the “T”
Data estimate1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• Monte Carlo:
• TD:• I.e., use V to estimate remaining return
• n-step TD:• n-step return:
• An “(n)” at the top of the “R” means n-step return (easy tomiss!) - Rt is basically
Note the Notation
Rt = rt+1 + ! rt+2 + !2rt+3 +!+ ! T " t"1rT
Rt(1) = rt+1 + !Vt (st+1)
Rt(n) = rt+1 + ! rt+2 + !
2rt+3 +!+ ! n"1rt+n + !nVt (st+n )
Note the “T”
R!t
1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Learning with N-step Backups
• Backup (on-line or off-line):
• Error reduction property of n-step returns
• Using this, you can show that n-step methods converge
maxs
E! {Rt(n) | st = s} "V
! (s) # $ nmaxsV (s) "V ! (s)
Maximum error using n-step return Maximum error using V
!Vt (st ) = " Rt(n) #Vt (st )$% &'
1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Learning with N-step Backups
• Backup (on-line or off-line):
• Error reduction property of n-step returns
• Using this, you can show that n-step methods converge
maxs
E! {Rt(n) | st = s} "V
! (s) # $ nmaxsV (s) "V ! (s)
Maximum error using n-step return Maximum error using V
!Vt (st ) = " Rt(n) #Vt (st )$% &'
1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Random Walk Examples
• How does 2-step TD work here?
• How about 3-step TD?
1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
A Larger Example• Task: 19 state
random walk• Each curve
corresponds to theerror after 10episodes.
• Each curve uses adifferent n.
• The x-axis is thelearning rate.
• Do you think there isan optimal n?for everything?
1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Averaging N-step Returns
• n-step methods were introduced to help withTD(!) understanding (next!)
• Idea: backup an average of several returns• e.g. backup half of 2-step and half of 4-step
• Called a complex backup
• To draw the backup diagram:• Draw each component
• Label with the weights for that component
Rtavg =
12Rt(2) +
12Rt(4 )
One backup
1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View of TD(!)• TD(!) is a method for
averaging allall n-step backups• weighted by !n-1 (time
since visitation)• !-return:
• Backup using !-return:
Rt! = (1" !) !n"1
n=1
#
$ Rt(n)
!Vt (st ) = " Rt# $Vt (st )%& '(
1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View of TD(!)• TD(!) is a method for
averaging allall n-step backups• weighted by !n-1 (time
since visitation)
• Note: Since the weights addup to 1, rt+1 is not“overemphasized.”
1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
!-return Weighting Function
Rt! = (1" !) !n"1
n=1
T " t"1
# Rt(n) + !T " t"1Rt
Until terminationAfter termination
1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Relation to TD(0) and MC
• !-return can be rewritten as:
• If ! = 1, you get MC:
• If ! = 0, you get TD(0)
Rt! = (1"1) 1n"1
n=1
T " t"1
# Rt(n) +1T " t"1Rt = Rt
Rt! = (1" 0) 0n"1
n=1
T " t"1
# Rt(n) + 0T " t"1Rt = Rt
(1)
Rt! = (1" !) !n"1
n=1
T " t"1
# Rt(n) + !T " t"1Rt
Until termination After termination
2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View of TD(!) II
• Look forward from each state to determine update fromfuture states and rewards:
2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
!-return on the Random Walk
• Same 19 state random walk as before• Why do you think intermediate values of ! are best?
2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Backward View
• Shout !t backwards over time
• The strength of your voice decreases with temporaldistance by "#$
! t = rt+1 + "Vt (st+1) #Vt (st )
2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Backward View of TD(!)
• The forward view was for theory
• The backward view is for mechanism
• New variable called eligibility trace• On each step, decay all traces by "#$ and increment the trace for the
current state by 1
• This gives the equivalent weighting as the forward algorithmwhen used incrementally
et (s)!"+
et (s) =!"et#1(s) if s $ st
!"et#1(s) +1 if s = st
%&'
('
2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Policy Evaluation Using TD(!)Initialize V (s) arbitrarilyRepeat (for each episode): e(s) = 0, for all s !S Initialize s Repeat (for each step of episode): a" action given by # for s Take action a, observe reward, r, and next state $s % " r + &V ( $s ) 'V (s) /* Compute error */ e(s) " e(s) +1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited */ V (s) "V (s) +(%e(s) /* incrementally update V(s) according to TD()) e(s) "&)e(s) /* decay trace */ s" $s /* Move to next state */ Until s is terminal
2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Relation of Backwards View to MC& TD(0)
• Using update rule:
• As before, if you set $ to 0, you get to TD(0)
• If you set $ to 1, you get MC but in a better way• Can apply TD(1) to continuing tasks
• Works incrementally and on-line (instead of waiting to the end ofthe episode)
• Again, by the time you get to the terminal state, theseincrements add up to the one you would have made via theforward (theoretical) view.
!Vt (s) = "# tet (s)
2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View = Backward View
• The forward (theoretical) view of TD(!) is equivalentto the backward (mechanistic) view for off-lineupdating
• The book shows:
• On-line updating with small " is similar
!VtTD (s)
t=0
T "1
# = $t=0
T "1
# Isst (%&)k" t'kk= t
T "1
# !Vt" (st )Isst
t=0
T #1
$ = %t=0
T #1
$ Isst (&")k# t'kk= t
T #1
$
!VtTD (s)
t=0
T "1
# = !Vt$ (st )
t=0
T "1
# Isst
Backward updates Forward updates
algebra shown in book
2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
On-line versus Off-line on RandomWalk
• Same 19 state random walk
• On-line performs better over a broader range of parameters
2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Control: Sarsa(!)
• Standard control idea: Learn Qvalues - so, save eligibility forstate-action pairs instead ofjust states
et (s,a) =!"et#1(s,a) +1 if s = st and a = at!"et#1(s,a) otherwise
$%&
'&
Qt+1(s,a) = Qt (s,a) +() tet (s,a)) t = rt+1 + !Qt (st+1,at+1) #Qt (st ,at )
2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Sarsa(!) AlgorithmInitialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) # $ r + %Q( !s , !a ) &Q(s,a) e(s,a) $ e(s,a) +1 For all s,a: Q(s,a) $Q(s,a) +'#e(s,a) e(s,a) $%(e(s,a) s$ !s ;a$ !a Until s is terminal
3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Sarsa(!) Gridworld Example
• With one trial, the agent has much more information abouthow to get to the goal
• not necessarily the best way
• Can considerably accelerate learning
3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Three Approaches to Q(!)• How can we extend this to Q-learning?
• Recall Q-learning is an off-policy method to learn Q* - and it usesthe max of the Q values for a state in its backup
• What happens if we make an exploratory move? This is NOT theright thing to back up over then…
• What to do?
• Three answers: Watkins, Peng, and the “naïve” answer.
3232CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Three Approaches to Q(!)• If you mark every state action
pair as eligible, you backup overnon-greedy policy
• Watkins: Zero out eligibilitytrace after a non-greedy action.Do max when backing up at firstnon-greedy choice.
et (s,a) =1+ !"et#1(s,a)
0!"et#1(s,a)
if s = st ,a = at ,Qt#1(st ,at ) = maxa Qt#1(st ,a) if Qt#1(st ,at ) $ maxa Qt#1(st ,a)
otherwise
%
&'
('
Qt+1(s,a) = Qt (s,a) +)* tet (s,a)* t = rt+1 + ! max +a Qt (st+1, +a ) #Qt (st ,at )
3333CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a* # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal
3434CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Peng’s Q(!)• Disadvantage to Watkins’s
method:• Early in learning, the eligibility
trace will be “cut” (zeroed out)frequently resulting in littleadvantage to traces
• Peng:• Backup max action except at
end
• Never cut traces
• Disadvantage:• Complicated to implement
3535CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Naïve Q(!)
• Idea: is it really a problem to backup exploratoryactions?
• Never zero traces
• Always backup max at current action (unlikePeng or Watkins’s)
• Is this truly naïve?
• Well, it works well in some empirical studies
What is the backup diagram?
3636CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Comparison Task
From McGovern and Sutton (1997). Towards a better Q(!)
• Compared Watkins’s, Peng’s, and Naïve (called McGovern’shere) Q(!) on several tasks.• See McGovern and Sutton (1997). Towards a Better Q($) for other
tasks and results (stochastic tasks, continuing tasks, etc)
• Deterministic gridworld with obstacles• 10x10 gridworld
• 25 randomly generated obstacles
• 30 runs• " = 0.05, # = 0.9, ! = 0.9, $ = 0.05, accumulating traces
3737CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Comparison Results
From McGovern and Sutton (1997). Towards a better Q(!) 3838CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Convergence of the Q(!)’s
• None of the methods are proven to converge.• Much extra credit if you can prove any of them.
• Watkins’s is thought to converge to Q*
• Peng’s is thought to converge to a mixture of Q% and Q*
• Naïve - Q*?
3939CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Eligibility Traces for Actor-Critic Methods
• Critic: On-policy learning of V%. Use TD(!) as describedbefore.
• Actor: Needs eligibility traces for each state-action pair.
• We change the update equation:
pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise
#$%
&%
pt+1(s,a) = pt (s,a) +!" tet (s,a)to
4040CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Eligibility Traces for Actor-Critic Methods
• Critic: On-policy learning of V%. Use TD(!) as describedbefore.
• Actor: Needs eligibility traces for each state-action pair.
• Can change the other actor-critic update:
pt+1(s,a) =pt (s,a) +!" t 1# $ (s,a)[ ] if a = at and s = st
pt (s,a) otherwise
%&'
('
topt+1(s,a) = pt (s,a) +!" tet (s,a)
et (s,a) =!"et#1(s,a) +1# $ t (st ,at ) if s = st and a = at
!"et#1(s,a) otherwise
%&'
('
where
4141CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Replacing Traces
• Using accumulating traces, frequently visited states canhave eligibilities greater than greater than 11• This can be a problem for convergence
• Replacing traces: Instead of addingadding 1 when you visit astate, set that trace to set that trace to 11
et (s) =!"et#1(s) if s $ st
1 if s = st
%&'
('
4242CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Replacing Traces Example• Same 19 state random walk task as before
• Replacing traces perform better than accumulating traces over morevalues of !
4343CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Why Replacing Traces?• Replacing traces can significantly speed learning
• They can make the system perform well for a broader set of parameters
• Accumulating traces can do poorly on certain types of tasks
Why is this task particularly onerous foraccumulating traces?
4444CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
More Replacing Traces
• Off-line replacing trace TD(1) is identical to first-visit MC
• Extension to action-values (Q values):• When you revisit a state, what should you do with the traces for the
actions you didn’t take?
• Singh and Sutton say to set them to zero:
et (s,a) =10
!"et#1(s,a)
$
%&
'&
if s = st and a = atif s = st and a ( at
if s ( st
4545CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Implementation Issues with Traces
• Could require much more computation• But most eligibility traces are VERY close to zero
• If you implement it in Matlab, backup is only one line ofcode and is very fast (Matlab is optimized for matrices)
4646CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Variable !• Can generalize to variable !
• Here ! is a function of time• Could define
et (s) =!"tet#1(s) if s $ st
!"tet#1(s) +1 if s = st
%&'
('
!t = !(st ) or !t = !t"
4747CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Conclusions
• Provides efficient, incremental way to combine MCand TD• Includes advantages of MC (can deal with lack of Markov
property)
• Includes advantages of TD (using TD error, bootstrapping)
• Can significantly speed learning
• Does have a cost in computation
4848CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
The two views
END5050CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Unified View