model-free vs. model- based rl: q, sarsa, & e 3. administrivia reminder: office hours tomorrow...

26
Model-Free vs. Model-Based RL: Q, SARSA, & E 3

Post on 19-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Model-Free vs. Model-Based RL: Q,

SARSA, & E3

Page 2: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Administrivia•Reminder:

•Office hours tomorrow truncated

•9:00-10:15 AM

•Can schedule other times if necessary

•Final projects

•Final presentations Dec 2, 7, 9

•20 min (max) presentations

•3 or 4 per day

•Sign up for presentation slots today!

Page 3: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The Q-learning algorithmAlgorithm: Q_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Repeat {

s=get_current_world_state()

a=pick_next_action(Q,s)

(r,s’)=act_in_world(a)

Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))

} Until (bored)

Page 4: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

SARSA-learning algorithmAlgorithm: SARSA_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

s=get_current_world_state()

a=pick_next_action(Q,s)

Repeat {

(r,s’)=act_in_world(a)

a’=pick_next_action(Q,s’)

Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a))

a=a’; s=s’;

} Until (bored)

Page 5: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

SARSA vs. Q•SARSA and Q-learning very similar

•SARSA updates Q(s,a) for the policy it’s actually executing

•Lets the pick_next_action() function pick action to update

•Q updates Q(s,a) for greedy policy w.r.t. current Q

•Uses max_a to pick action to update

•might be diff than the action it executes at s’

•In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing

•Exploration can get Q-learning in trouble...

Page 6: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Radioactive breadcrumbs•Can now define eligibility traces for SARSA

•In addition to Q(s,a) table, keep an e(s,a) table

•Records “eligibility” (real number) for each state/action pair

•At every step ((s,a,r,s’,a’) tuple):

•Increment e(s,a) for current (s,a) pair by 1

•Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’)

•Decay all e(s’’,a’’) by factor of λγ

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

Page 7: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

SARSA(λ)-learning alg.Algorithm: SARSA(λ)_learnInputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1)Outputs: Qe(s,a)=0 // for all s, as=get_curr_world_st(); a=pick_nxt_act(Q,s);Repeat {

(r,s’)=act_in_world(a)a’=pick_next_action(Q,s’)δ=r+γ*Q(s’,a’)-Q(s,a)e(s,a)+=1foreach (s’’,a’’) pair in (SXA) {

Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δe(s’’,a’’)*=λγ }

a=a’; s=s’;} Until (bored)

Page 8: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The trail of crumbs

Sutton & Barto, Sec 7.5

Page 9: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The trail of crumbs

Sutton & Barto, Sec 7.5

λ=0

Page 10: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The trail of crumbs

Sutton & Barto, Sec 7.5

Page 11: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Eligibility for a single state

e(si,a

j)

1st visit2nd visit ...

Sutton & Barto, Sec 7.5

Page 12: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Eligibility trace followup•Eligibility trace allows:

•Tracking where the agent has been

•Backup of rewards over longer periods

•Credit assignment: state/action pairs rewarded for having contributed to getting to the reward

•Why does it work?

Page 13: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.•Original SARSA did “one step” backup:

Q(s,a)r

t

Q(st+1

,at+1

)

Rest of trajectoryInfo backup

Page 14: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.•Original SARSA did “one step” backup:

•Could also do a “two step backup”:

Q(s,a)r

t

Q(st+2

,at+2

)

Rest of trajectory

rt+1

Info backup

Page 15: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.•Original SARSA did “one step” backup:

•Could also do a “two step backup”:

•Or even an “n step backup”:

Page 16: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.•Small-step backups (n=1, n=2, etc.) are

slow and nearsighted

•Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects

•Want a way to combine them

•Can take a weighted average of different backups

•E.g.:

Page 17: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.

1/3

2/3

Page 18: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.•How do you know which number of steps

to avg over? And what the weights should be?

•Accumulating eligibility traces are just a clever way to easily avg. over all n:

Page 19: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The “forward view” of elig.λ0

λ1

λ2

λn-1

Page 20: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Replacing traces•Kind just described are accumulating e-

traces

•Every time you go back to state, add extra e.

•There are also replacing eligibility traces

•Every time you go back to a state/action, reset e(s,a) to 1

•Works better sometimes

Sutton &Barto, Sec 7.8

Page 21: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Model-free vs.Model-based

Page 22: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

What do you know?•Both Q-learning and SARSA(λ) are model

free methods

•A.k.a., value-based methods

•Learn a Q function

•Never learn T or R explicitly

•At the end of learning, agent knows how to act, but doesn’t explicitly know anything about the environment

•Also, no guarantees about explore/exploit tradeoff

•Sometimes, want one or both of the above

Page 23: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

Model-based methods•Model based methods, OTOH, do

explicitly learn T & R

•At end of learning, have entire M= 〈 S,A,T,R 〉

•Also have π*

•At least one model-based method also guarantees explore/exploit tradeoff properties

Page 24: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

E3

•Efficient Explore & Exploit algorithm•Kearns & Singh, Machine Learning 49, 2002

•Explicitly keeps a T matrix and a R table•Plan (policy iter) w/ curr. T & R -> curr. π

•Every state/action entry in T and R:•Can be marked known or unknown•Has a #visits counter, nv(s,a)

•After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average)

•When nv(s,a)>NVthresh , mark cell as known & re-plan

•When all states known, done learning & have π*

Page 25: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The E3 algorithmAlgorithm: E3_learn_sketch // only an overviewInputs: S, A, γ (0<=γ<1), NVthresh, R

max, Var

max

Outputs: T, R, π*Initialization:

R(s)=Rmax // for all s

T(s,a,s’)=1/|S| // for all s,a,s’known(s,a)=0; nv(s,a)=0; // for all s, aπ=policy_iter(S,A,T,R)

Page 26: Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary

The E3 algorithmAlgorithm: E3_learn_sketch // con’tRepeat {

s=get_current_world_state()a=π(s)(r,s’)=act_in_world(a)T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1)nv(s,a)++;if (nv(s,a)>NVthresh) {

known(s,a)=1;π=policy_iter(S,A,T,R)

}} Until (all (s,a) known)