rl - worksheet -worked exercise-

RL - Worksheet -worked exercise- Ata Kaban [email protected] School of Computer Science University of Birmingham

Upload: arden-preston

Post on 01-Jan-2016

20 views

Category:

Documents

2 download

Report

Download

Embed Size (px):

DESCRIPTION

Ata Kaban [email protected] School of Computer Science University of Birmingham. RL - Worksheet -worked exercise-. 2. 1. -10. 50. 50. -2. -10. -2. -2. 3. 4. -2. RL. Exercise. - PowerPoint PPT Presentation

TRANSCRIPT

RL - Worksheet-worked exercise-

Ata [email protected]

School of Computer ScienceUniversity of Birmingham

mailto:[email protected]

Page 2: RL - Worksheet -worked exercise-

RL. ExerciseThe figure below depicts a 4-state grid world, which’s state 2 represents

the ‘gold’. Using the immediate reward values shown on the figure and employing the Q-learning algorithm, do anti-clockwise circuits on the four states updating the action-state table.

-101

-2 50 -2 -10

-2

Note.

Here, the Q-table will be updated after each cycle.

Page 3: RL - Worksheet -worked exercise-

Solution

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

4 0 0 0 0

Initialise each entry of the table of Q values to zero

-101

-250-2-10

-2

}),,(max{

),(),(

actionsallforasQ

asrasQ

newnewnew

old

Iterate:

Page 4: RL - Worksheet -worked exercise-

First circuit:Q(3, ) = -2 +0.9 max{Q(4, ),Q(4, )}= -2

Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50

Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10

Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2

Q(3, ) = -2 +0.9 max{Q(4, ),50}=43

1 - -2 0 -

2 - 0 - -10

3 0 - 43 -

4 50 - - 0

-101

-250-2-10

-2

Page 5: RL - Worksheet -worked exercise-

Second circuit:

Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50 +0.9 max{0,-10}=50

Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10 +0.9 max{0,-2}=-10

Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2 +0.9 max{0,43}= 36.7

Q(3, ) = -2 +0.9 max{Q(4, ), Q(4,)}=-2 +0.9 max{0,50}=43

1 - -2 50 -

2 - -2 - -10

3 -10 - -2 -

4 50 - - -2

1 - 36.7 0 -

2 - 0 - -10

3 0 - 43 -

4 50 - - 0

Page 6: RL - Worksheet -worked exercise-

Third circuit:

Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50 +0.9 max{0,-10}=50

Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10 +0.9 max{0,36.7}=23.03

Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2 +0.9 max{0,43}= 36.7

Q(3, ) = -2 +0.9 max{Q(4, ), Q(4,)}=-2 +0.9 max{0,50}=43

1 - -2 50 -

2 - -2 - -10

3 -10 - -2 -

4 50 - - -2

1 - 36.7 0 -

2 - 0 - 23.03

3 0 - 43 -

4 50 - - 0

Page 7: RL - Worksheet -worked exercise-

Fourth circuit:

Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50 +0.9 max{0,23.03}=70.73

Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10 +0.9 max{0,36.7}=23.03

Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2 +0.9 max{0,43}= 36.7

Q(3, ) = -2 +0.9 max{Q(4, ), Q(4,)}=-2 +0.9 max{0,70.73}=61.66

1 - -2 50 -

2 - -2 - -10

3 -10 - -2 -

4 50 - - -2

1 - 36.7 0 -

2 - 0 - 23.03

3 0 - 61.66 -

4 70.73 - - 0

Page 8: RL - Worksheet -worked exercise-

Optional material: Convergence proof of Q-learning • Recall: Sketch of proof

Consider the case of deterministic world, where each (s,a) is visited infinitely often.

Define a full interval as an interval during which each (s,a) is visited.

Show, that during any such interval, the absolute value of the largest error in Q table is reduced by a factor of .

Consequently, as <1, then after infinitely many updates, the largest error converges to zero.

Page 9: RL - Worksheet -worked exercise-

Solution• Let be a table after n updates and en be the

maximum error in this table:

• What is the maximum error after the (n+1)-th update?

|),(),(ˆ|max,

asQasQe nas

nQ̂

nnas

nna

asQasQ

asQrasQr

asQasQe

|)',''()',''(ˆ|max

|)','()','(ˆ|max

|)','(max)','(ˆmax|

|))','(max())','(ˆmax(|

|),(),(ˆ|

',''

Page 10: RL - Worksheet -worked exercise-

• Obs. No assumption was made over the action sequence! Thus, Q-learning can learn the Q function (and hence the optimal policy) while training from actions chosen at random as long as the resulting training sequence visits every (state, action) infinitely often.

Chapter 16 Jeopardy Review: Acid-Base Equilibria SEE SEPARATE WORKSHEET FOR WORKED OUT ANSWER SOLUTIONS

Frederick St. Apartments GA Section · amenities pool deck gym & amenities rl -6980 b3 level b2 carpark basement extension refer to plan rl -2600 rl -5300 rl -8000 rl 3250 rl 100

DEVELOPMENT APPLICATION - blacktown.nsw.gov.au · RL.34.86 Ground Floor RL.38.15 Level 1 Lower RL.42.84 EX-Roof RL.43.65 EX-Parapet RL.38.86 Level 1 Upper 3 DA3000 4 DA3000 1 DA3000

Worked solutions available from LondonScienceTutors…€¦ · Worked solutions available from LondonScienceTutors.com ... Worksheet No . 151201 ... The iron produced from iron ore

Compare and contrast Zippity and George to the stray dogs (RL.1.9, RL. 2.9, RL.3.9)

Joshua Tree Joshua Tree - Countywide Plancountywideplan.com/.../2017/10/LUDs_JoshuaTree_171006.pdf · Joshua Tree Homestead Valley RL RL RL RL R/LM LDR RL RL F Date: 10/9/2017 0 3,000

EYEWEAR COLLECTION - FALL/WINTER 2017 · 2018. 1. 10. · fall/winter 2017 women styles optical rl 6169 8 rl 6170 8 sun rl 7057 12 rl 7058 12 rl 8159 13 men styles optical rl 5100

rl lecture 09 - 19 · Title: rl lecture 09 - 19 Author: CamScanner Subject: rl lecture 09 - 19

time signatures - Music Fun Worksheets · Worksheet 1 Worksheet 2 Worksheet 3 Worksheet 4 Worksheet 5 Worksheet 6 Worksheet 7 Worksheet 8 Worksheet 9 Worksheet 10 Worksheet 11 Worksheet

Worked out problems from 5.25.3 Worksheet

RL-7500-1, RL-7500-2, RL-7500-3, RL-7500-4 SHIELDED HIGH ...rl-7500-2 0.49 (12.80) 0.49 (12.80) 0.24 (6.00) 0.28 (7.00) 0.11 (2.80) 0.21 (5.40) ... rl-7500-3-220m 22.0 ± 20% 0.0432

RL-SS-1608 - RVUSA.comrl-ss-1601 rl-ss-1603 rl-ss-1609 rl-ss-1608 rl-ss-1610 rl-ss-1605 table dinette step fridge pantry wardrobe sink o.h. lp bedroom storage foot lockers bedroom