model-free vs. model- based rl: q, sarsa, & e 3. administrivia reminder: office hours tomorrow...

Model-Free vs. Model-Based RL: Q,

SARSA, & E3

Administrivia•Reminder:

•Office hours tomorrow truncated

•9:00-10:15 AM

•Can schedule other times if necessary

•Final projects

•Final presentations Dec 2, 7, 9

•20 min (max) presentations

•3 or 4 per day

•Sign up for presentation slots today!

The Q-learning algorithmAlgorithm: Q_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Repeat {

s=get_current_world_state()

a=pick_next_action(Q,s)

(r,s’)=act_in_world(a)

Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))

} Until (bored)

SARSA-learning algorithmAlgorithm: SARSA_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

s=get_current_world_state()

a=pick_next_action(Q,s)

Repeat {

(r,s’)=act_in_world(a)

a’=pick_next_action(Q,s’)

Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a))

a=a’; s=s’;

} Until (bored)

SARSA vs. Q•SARSA and Q-learning very similar

•SARSA updates Q(s,a) for the policy it’s actually executing

•Lets the pick_next_action() function pick action to update

•Q updates Q(s,a) for greedy policy w.r.t. current Q

•Uses max_a to pick action to update

•might be diff than the action it executes at s’

•In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing

•Exploration can get Q-learning in trouble...

Radioactive breadcrumbs•Can now define eligibility traces for SARSA

•In addition to Q(s,a) table, keep an e(s,a) table

•Records “eligibility” (real number) for each state/action pair

•At every step ((s,a,r,s’,a’) tuple):

•Increment e(s,a) for current (s,a) pair by 1

•Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’)

•Decay all e(s’’,a’’) by factor of λγ

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(λ)-learning alg.Algorithm: SARSA(λ)_learnInputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1)Outputs: Qe(s,a)=0 // for all s, as=get_curr_world_st(); a=pick_nxt_act(Q,s);Repeat {

(r,s’)=act_in_world(a)a’=pick_next_action(Q,s’)δ=r+γ*Q(s’,a’)-Q(s,a)e(s,a)+=1foreach (s’’,a’’) pair in (SXA) {

Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δe(s’’,a’’)*=λγ }

a=a’; s=s’;} Until (bored)

The trail of crumbs

Sutton & Barto, Sec 7.5

The trail of crumbs


λ=0

The trail of crumbs


Eligibility for a single state

e(si,a

j)

1st visit2nd visit ...


Eligibility trace followup•Eligibility trace allows:

•Tracking where the agent has been

•Backup of rewards over longer periods

•Credit assignment: state/action pairs rewarded for having contributed to getting to the reward

•Why does it work?

The “forward view” of elig.•Original SARSA did “one step” backup:

Q(s,a)r

t

Q(st+1

,at+1

)

Rest of trajectoryInfo backup


•Could also do a “two step backup”:

Q(s,a)r

t

Q(st+2

,at+2

)

Rest of trajectory

rt+1

Info backup


•Could also do a “two step backup”:

•Or even an “n step backup”:

The “forward view” of elig.•Small-step backups (n=1, n=2, etc.) are

slow and nearsighted

•Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects

•Want a way to combine them

•Can take a weighted average of different backups

•E.g.:

The “forward view” of elig.

1/3

2/3

The “forward view” of elig.•How do you know which number of steps

to avg over? And what the weights should be?

•Accumulating eligibility traces are just a clever way to easily avg. over all n:

The “forward view” of elig.λ0

λ1

λ2

λn-1

Replacing traces•Kind just described are accumulating e-

traces

•Every time you go back to state, add extra e.

•There are also replacing eligibility traces

•Every time you go back to a state/action, reset e(s,a) to 1

•Works better sometimes

Sutton &Barto, Sec 7.8

Model-free vs.Model-based

What do you know?•Both Q-learning and SARSA(λ) are model

free methods

•A.k.a., value-based methods

•Learn a Q function

•Never learn T or R explicitly

•At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment

•Also, no guarantees about explore/exploit tradeoff

•Sometimes, want one or both of the above

Model-based methods•Model based methods, OTOH, do

explicitly learn T & R

•At end of learning, have entire M= 〈 S,A,T,R 〉

•Also have π*

•At least one model-based method also guarantees explore/exploit tradeoff properties

E3

•Efficient Explore & Exploit algorithm•Kearns & Singh, Machine Learning 49, 2002

•Explicitly keeps a T matrix and a R table•Plan (policy iter) w/ curr. T & R -> curr. π

•Every state/action entry in T and R:•Can be marked known or unknown•Has a #visits counter, nv(s,a)

•After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average)

•When nv(s,a)>NVthresh , mark cell as known & re-plan

•When all states known, done learning & have π*

The E3 algorithmAlgorithm: E3_learn_sketch // only an overviewInputs: S, A, γ (0<=γ<1), NVthresh, R

max, Var

max

Outputs: T, R, π*Initialization:

R(s)=Rmax // for all s

T(s,a,s’)=1/|S| // for all s,a,s’known(s,a)=0; nv(s,a)=0; // for all s, aπ=policy_iter(S,A,T,R)

The E3 algorithmAlgorithm: E3_learn_sketch // con’tRepeat {

s=get_current_world_state()a=π(s)(r,s’)=act_in_world(a)T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1)nv(s,a)++;if (nv(s,a)>NVthresh) {

known(s,a)=1;π=policy_iter(S,A,T,R)

}} Until (all (s,a) known)

model-free vs. model- based rl: q, sarsa, & e 3. administrivia reminder: office hours tomorrow...

Documents