model-free vs. model- based rl: q, sarsa, & e 3. administrivia reminder: office hours tomorrow...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Model-Free vs. Model-Based RL: Q,
SARSA, & E3
Administrivia•Reminder:
•Office hours tomorrow truncated
•9:00-10:15 AM
•Can schedule other times if necessary
•Final projects
•Final presentations Dec 2, 7, 9
•20 min (max) presentations
•3 or 4 per day
•Sign up for presentation slots today!
The Q-learning algorithmAlgorithm: Q_learn
Inputs: State space S; Act. space A
Discount γ (0<=γ<1); Learning rate α (0<=α<1)
Outputs: Q
Repeat {
s=get_current_world_state()
a=pick_next_action(Q,s)
(r,s’)=act_in_world(a)
Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))
} Until (bored)
SARSA-learning algorithmAlgorithm: SARSA_learn
Inputs: State space S; Act. space A
Discount γ (0<=γ<1); Learning rate α (0<=α<1)
Outputs: Q
s=get_current_world_state()
a=pick_next_action(Q,s)
Repeat {
(r,s’)=act_in_world(a)
a’=pick_next_action(Q,s’)
Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a))
a=a’; s=s’;
} Until (bored)
SARSA vs. Q•SARSA and Q-learning very similar
•SARSA updates Q(s,a) for the policy it’s actually executing
•Lets the pick_next_action() function pick action to update
•Q updates Q(s,a) for greedy policy w.r.t. current Q
•Uses max_a to pick action to update
•might be diff than the action it executes at s’
•In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing
•Exploration can get Q-learning in trouble...
Radioactive breadcrumbs•Can now define eligibility traces for SARSA
•In addition to Q(s,a) table, keep an e(s,a) table
•Records “eligibility” (real number) for each state/action pair
•At every step ((s,a,r,s’,a’) tuple):
•Increment e(s,a) for current (s,a) pair by 1
•Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’)
•Decay all e(s’’,a’’) by factor of λγ
•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL
SARSA(λ)-learning alg.Algorithm: SARSA(λ)_learnInputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1)Outputs: Qe(s,a)=0 // for all s, as=get_curr_world_st(); a=pick_nxt_act(Q,s);Repeat {
(r,s’)=act_in_world(a)a’=pick_next_action(Q,s’)δ=r+γ*Q(s’,a’)-Q(s,a)e(s,a)+=1foreach (s’’,a’’) pair in (SXA) {
Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δe(s’’,a’’)*=λγ }
a=a’; s=s’;} Until (bored)
The trail of crumbs
Sutton & Barto, Sec 7.5
The trail of crumbs
Sutton & Barto, Sec 7.5
λ=0
The trail of crumbs
Sutton & Barto, Sec 7.5
Eligibility for a single state
e(si,a
j)
1st visit2nd visit ...
Sutton & Barto, Sec 7.5
Eligibility trace followup•Eligibility trace allows:
•Tracking where the agent has been
•Backup of rewards over longer periods
•Credit assignment: state/action pairs rewarded for having contributed to getting to the reward
•Why does it work?
The “forward view” of elig.•Original SARSA did “one step” backup:
Q(s,a)r
t
Q(st+1
,at+1
)
Rest of trajectoryInfo backup
The “forward view” of elig.•Original SARSA did “one step” backup:
•Could also do a “two step backup”:
Q(s,a)r
t
Q(st+2
,at+2
)
Rest of trajectory
rt+1
Info backup
The “forward view” of elig.•Original SARSA did “one step” backup:
•Could also do a “two step backup”:
•Or even an “n step backup”:
The “forward view” of elig.•Small-step backups (n=1, n=2, etc.) are
slow and nearsighted
•Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects
•Want a way to combine them
•Can take a weighted average of different backups
•E.g.:
The “forward view” of elig.
1/3
2/3
The “forward view” of elig.•How do you know which number of steps
to avg over? And what the weights should be?
•Accumulating eligibility traces are just a clever way to easily avg. over all n:
The “forward view” of elig.λ0
λ1
λ2
λn-1
Replacing traces•Kind just described are accumulating e-
traces
•Every time you go back to state, add extra e.
•There are also replacing eligibility traces
•Every time you go back to a state/action, reset e(s,a) to 1
•Works better sometimes
Sutton &Barto, Sec 7.8
Model-free vs.Model-based
What do you know?•Both Q-learning and SARSA(λ) are model
free methods
•A.k.a., value-based methods
•Learn a Q function
•Never learn T or R explicitly
•At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment
•Also, no guarantees about explore/exploit tradeoff
•Sometimes, want one or both of the above
Model-based methods•Model based methods, OTOH, do
explicitly learn T & R
•At end of learning, have entire M= 〈 S,A,T,R 〉
•Also have π*
•At least one model-based method also guarantees explore/exploit tradeoff properties
E3
•Efficient Explore & Exploit algorithm•Kearns & Singh, Machine Learning 49, 2002
•Explicitly keeps a T matrix and a R table•Plan (policy iter) w/ curr. T & R -> curr. π
•Every state/action entry in T and R:•Can be marked known or unknown•Has a #visits counter, nv(s,a)
•After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average)
•When nv(s,a)>NVthresh , mark cell as known & re-plan
•When all states known, done learning & have π*
The E3 algorithmAlgorithm: E3_learn_sketch // only an overviewInputs: S, A, γ (0<=γ<1), NVthresh, R
max, Var
max
Outputs: T, R, π*Initialization:
R(s)=Rmax // for all s
T(s,a,s’)=1/|S| // for all s,a,s’known(s,a)=0; nv(s,a)=0; // for all s, aπ=policy_iter(S,A,T,R)
The E3 algorithmAlgorithm: E3_learn_sketch // con’tRepeat {
s=get_current_world_state()a=π(s)(r,s’)=act_in_world(a)T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1)nv(s,a)++;if (nv(s,a)>NVthresh) {
known(s,a)=1;π=policy_iter(S,A,T,R)
}} Until (all (s,a) known)