rl - worksheet -worked exercise-
DESCRIPTION
Ata Kaban [email protected] School of Computer Science University of Birmingham. RL - Worksheet -worked exercise-. 2. 1. -10. 50. 50. -2. -10. -2. -2. 3. 4. -2. RL. Exercise. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/1.jpg)
RL - Worksheet-worked exercise-
School of Computer ScienceUniversity of Birmingham
![Page 2: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/2.jpg)
RL. ExerciseThe figure below depicts a 4-state grid world, which’s state 2 represents
the ‘gold’. Using the immediate reward values shown on the figure and employing the Q-learning algorithm, do anti-clockwise circuits on the four states updating the action-state table.
-101
3
2
4
50
-2 50 -2 -10
-2
-2
Note.
Here, the Q-table will be updated after each cycle.
![Page 3: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/3.jpg)
Solution
Q
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
Initialise each entry of the table of Q values to zero
-101
3
2
4
50
-250-2-10
-2
-2
}),,(max{
),(),(
actionsallforasQ
asrasQ
newnewnew
old
Iterate:
![Page 4: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/4.jpg)
First circuit:Q(3, ) = -2 +0.9 max{Q(4, ),Q(4, )}= -2
Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50
Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10
Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2
Q(3, ) = -2 +0.9 max{Q(4, ),50}=43
Q
1 - -2 0 -
2 - 0 - -10
3 0 - 43 -
4 50 - - 0
-101
3
2
4
50
-250-2-10
-2
-2
![Page 5: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/5.jpg)
Second circuit:
Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50 +0.9 max{0,-10}=50
Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10 +0.9 max{0,-2}=-10
Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2 +0.9 max{0,43}= 36.7
Q(3, ) = -2 +0.9 max{Q(4, ), Q(4,)}=-2 +0.9 max{0,50}=43
r
1 - -2 50 -
2 - -2 - -10
3 -10 - -2 -
4 50 - - -2
Q
1 - 36.7 0 -
2 - 0 - -10
3 0 - 43 -
4 50 - - 0
![Page 6: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/6.jpg)
Third circuit:
Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50 +0.9 max{0,-10}=50
Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10 +0.9 max{0,36.7}=23.03
Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2 +0.9 max{0,43}= 36.7
Q(3, ) = -2 +0.9 max{Q(4, ), Q(4,)}=-2 +0.9 max{0,50}=43
r
1 - -2 50 -
2 - -2 - -10
3 -10 - -2 -
4 50 - - -2
Q
1 - 36.7 0 -
2 - 0 - 23.03
3 0 - 43 -
4 50 - - 0
![Page 7: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/7.jpg)
Fourth circuit:
Q(4, ) = 50 +0.9 max{Q(2, ),Q(2, )}= 50 +0.9 max{0,23.03}=70.73
Q(2, ) = -10 +0.9 max{Q(1, ),Q(1, )}= -10 +0.9 max{0,36.7}=23.03
Q(1, ) = -2 +0.9 max{Q(3, ),Q(3, )}= -2 +0.9 max{0,43}= 36.7
Q(3, ) = -2 +0.9 max{Q(4, ), Q(4,)}=-2 +0.9 max{0,70.73}=61.66
r
1 - -2 50 -
2 - -2 - -10
3 -10 - -2 -
4 50 - - -2
Q
1 - 36.7 0 -
2 - 0 - 23.03
3 0 - 61.66 -
4 70.73 - - 0
![Page 8: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/8.jpg)
Optional material: Convergence proof of Q-learning • Recall: Sketch of proof
Consider the case of deterministic world, where each (s,a) is visited infinitely often.
Define a full interval as an interval during which each (s,a) is visited.
Show, that during any such interval, the absolute value of the largest error in Q table is reduced by a factor of .
Consequently, as <1, then after infinitely many updates, the largest error converges to zero.
![Page 9: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/9.jpg)
Solution• Let be a table after n updates and en be the
maximum error in this table:
• What is the maximum error after the (n+1)-th update?
|),(),(ˆ|max,
asQasQe nas
n
nQ̂
n
nnas
nna
na
na
na
na
nn
e
asQasQ
asQasQ
asQasQ
asQrasQr
asQasQe
|)',''()',''(ˆ|max
|)','()','(ˆ|max
|)','(max)','(ˆmax|
|))','(max())','(ˆmax(|
|),(),(ˆ|
',''
'
''
''
11
![Page 10: RL - Worksheet -worked exercise-](https://reader036.vdocuments.us/reader036/viewer/2022082818/56813023550346895d95a806/html5/thumbnails/10.jpg)
• Obs. No assumption was made over the action sequence! Thus, Q-learning can learn the Q function (and hence the optimal policy) while training from actions chosen at random as long as the resulting training sequence visits every (state, action) infinitely often.