computer science education lab, umass, amherst ...smaji/cmpsci689/slides/lec23_rl...markov decision...
TRANSCRIPT
![Page 1: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/1.jpg)
Reinforcement Learning
Kevin SpiteriApril 21, 2015
![Page 2: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/2.jpg)
n-armed bandit
![Page 3: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/3.jpg)
n-armed bandit
0.9 0.5 0.1
![Page 4: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/4.jpg)
n-armed bandit
0.9 0.5 0.1
0.0 0.0 0.0 estimate0.0
![Page 5: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/5.jpg)
n-armed bandit
0.9 0.5 0.1
0.0 0.0 0.0 estimate0.0
0 0 0.0 attempts0
0 0 0.0 payoff00
0
0
![Page 6: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/6.jpg)
n-armed bandit
0.9 0.5 0.1
0.0 0.0 1.0 0.0 estimate0.0
0 0 1 0.0 attempts0
0 0 1 0.0 payoff01
1
1.0
![Page 7: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/7.jpg)
n-armed bandit
0.9 0.5 0.1
0.0 1.0 0.5 0.0 estimate0.0
0 1 2 0.0 attempts0
0 1 1 0.0 payoff01
2
0.5
![Page 8: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/8.jpg)
Exploration
0.9 0.5 0.1
0.0 1.0 0.5 0.0 estimate0.0
0 1 2 0.0 attempts0
0 1 1 0.0 payoff02
3
0.67
![Page 9: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/9.jpg)
Going on …
0.9 0.5 0.1
0.9 0.5 0.0 estimate0.1
280 10 0.0 attempts10
252 5 0.0 payoff1258
300
0.86
![Page 10: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/10.jpg)
Changing environment
0.7 0.8 0.1
0.9 0.5 0.0 estimate0.1
280 10 0.0 attempts10
252 5 0.0 payoff1258
300
0.86
![Page 11: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/11.jpg)
Changing environment
0.7 0.8 0.1
0.8 0.65 0.0 estimate0.1
560 20 0.0 attempts20
448 13 0.0 payoff2463
600
0.77
![Page 12: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/12.jpg)
Changing environment
0.7 0.8 0.1
0.74 0.74 0.0 estimate0.1
1400 50 0.0 attempts50
1036 37 0.0 payoff51078
1500
0.72
![Page 13: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/13.jpg)
n-armed bandit
● Optimal payoff (0.82):0.9 x 300 + 0.8 x 1200 = 1230
● Actual payoff (0.72):0.9 x 280 + 0.5 x 10 + 0.1 x 10 +0.7 x 1120 + 0.8 x 40 + 0.1 x 40 = 1078
![Page 14: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/14.jpg)
n-armed bandit
● Evaluation vs instruction.
● Discounting.
● Initial estimates.
● There is no best way or standard way.
![Page 15: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/15.jpg)
Markov Decision Process(MDP)
![Page 16: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/16.jpg)
Markov Decision Process(MDP)● States
![Page 17: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/17.jpg)
Markov Decision Process(MDP)● States
![Page 18: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/18.jpg)
Markov Decision Process(MDP)● States● Actions
ab
c
![Page 19: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/19.jpg)
Markov Decision Process(MDP)● States● Actions● Model
a 0.25
a 0.75b
c
![Page 20: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/20.jpg)
Markov Decision Process(MDP)● States● Actions● Model
a 0.25
a 0.75b
c
![Page 21: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/21.jpg)
Markov Decision Process(MDP)● States● Actions● Model● Reward
a 0.25
a 0.75b
c
5 -1
0
![Page 22: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/22.jpg)
Markov Decision Process(MDP)● States● Actions● Model● Reward● Policy a 0.25
a 0.75b
c
5 -1
0
![Page 23: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/23.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
t ht
b f
![Page 24: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/24.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
ht
b f
![Page 25: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/25.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
ab
c
ht
b f
![Page 26: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/26.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
a 0.25
a 0.75b
c
ht
b f
![Page 27: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/27.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
a 0.25
a 0.75b
c
ht
b f
![Page 28: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/28.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
a 0.25
a 0.75b
c
5 -1
0
ht
b f
![Page 29: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/29.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
a 0.25
a 0.75b
c
5 -1
0
ht
b f
Expected reward per round:0.25 x 5 + 0.75 x (-1) = 0.5
![Page 30: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/30.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
a 0.25
a 0.75b
c
5 -1
0-1
ht
b f
![Page 31: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/31.jpg)
Markov Decision Process(MDP)● States: ball
tablehandbasketfloor
● Actions:a) attemptb) dropc) wait
a 0.25
a 0.75b
c
5 -1
0-1
ht
b f
![Page 32: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/32.jpg)
Reinforcement Learning Tools
● Dynamic Programming
● Monte Carlo Methods
● Temporal Difference Learning
![Page 33: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/33.jpg)
Grid World
Reward:
Normal move:-1
Over obstacle:-10
Best reward:-15
![Page 34: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/34.jpg)
Optimal Policy
![Page 35: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/35.jpg)
Value Function
-15 -8
-14 -9
-7 0
-6 -1
-13 -10
-12 -11
-5 -2
-4 -3
![Page 36: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/36.jpg)
Initial Policy
![Page 37: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/37.jpg)
Policy Iteration
-21 -11
-22 -12
-10 0
-11 -1
-23 -13
-24 -14
-12 -2
-13 -3
![Page 38: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/38.jpg)
Policy Iteration
-21 -11
-22 -12
-10 0
-11 -1
-23 -13
-24 -14
-12 -2
-13 -3
![Page 39: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/39.jpg)
Policy Iteration
-21 -11
-22 -12
-10 0
-11 -1
-23 -13
-24 -14
-12 -2
-13 -3
![Page 40: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/40.jpg)
Policy Iteration
![Page 41: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/41.jpg)
Policy Iteration
-21 -11
-22 -12
-10 0
-11 -1
-23 -13
-15 -14
-12 -2
-4 -3
![Page 42: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/42.jpg)
Policy Iteration
-21 -11
-22 -12
-10 0
-11 -1
-23 -13
-15 -14
-12 -2
-4 -3
![Page 43: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/43.jpg)
Policy Iteration
-21 -11
-22 -12
-10 0
-11 -1
-23 -13
-15 -14
-12 -2
-4 -3
![Page 44: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/44.jpg)
Policy Iteration
![Page 45: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/45.jpg)
Policy Iteration
-15 -8
-14 -9
-7 0
-6 -1
-13 -10
-12 -11
-5 -2
-4 -3
![Page 46: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/46.jpg)
Value Iteration
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
![Page 47: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/47.jpg)
Value Iteration
-1 -1
-1 -1
-1 0
-1 -1
-1 -1
-1 -1
-1 -1
-1 -1
![Page 48: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/48.jpg)
Value Iteration
-2 -2
-2 -2
-2 0
-2 -1
-2 -2
-2 -2
-2 -2
-2 -2
![Page 49: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/49.jpg)
Value Iteration
-3 -3
-3 -3
-3 0
-3 -1
-3 -3
-3 -3
-3 -2
-3 -3
![Page 50: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/50.jpg)
Value Iteration
-15 -8
-14 -9
-7 0
-6 -1
-13 -10
-12 -11
-5 -2
-4 -3
![Page 51: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/51.jpg)
Stochastic Model
0.025
0.95
0.025
![Page 52: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/52.jpg)
Value Iteration
-19.2 -10.4
-18.1 -12.1
-9.3 0
-8.2 -1.5
-17.0 -13.6
-15.7 -14.7
-6.7 -2.9
-5.1 -4.0
0.95
0.025 0.025
![Page 53: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/53.jpg)
Value Iteration
-19.2 -10.4
-18.1 -12.1
-9.3 0
-8.2 -1.5
-17.0 -13.6
-15.7 -14.7
-6.7 -2.9
-5.1 -4.0
0.95
0.025 0.025
E.g. 13.6:
13.6 =0.950 x 13.1 +0.025 x 27.0 +0.025 x 16.7
16.6 =0.950 x 16.7 +0.025 x 13.1 +0.025 x 15.7
![Page 54: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/54.jpg)
Richard Bellman
![Page 55: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/55.jpg)
Bellman Equation
![Page 56: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/56.jpg)
Reinforcement Learning Tools
● Dynamic Programming
● Monte Carlo Methods
● Temporal Difference Learning
![Page 57: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/57.jpg)
Monte Carlo Methods
0.025
0.95
0.025
![Page 58: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/58.jpg)
Monte Carlo Methods0.95
0.025 0.025
![Page 59: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/59.jpg)
Monte Carlo Methods
-32 -22
-21
-10 0
-11
0.95
0.025 0.025
![Page 60: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/60.jpg)
Monte Carlo Methods0.95
0.025 0.025
![Page 61: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/61.jpg)
Monte Carlo Methods
-21 -11 -10 0
0.95
0.025 0.025
![Page 62: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/62.jpg)
Monte Carlo Methods0.95
0.025 0.025
![Page 63: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/63.jpg)
Monte Carlo Methods
-32
-31 -21
-10 0
-11
0.95
0.025 0.025
![Page 64: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/64.jpg)
Q-Value
-15 -10
-8 -20
0.95
0.025 0.025
![Page 65: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/65.jpg)
Bellman Equation
-15 -10
-8 -20
![Page 66: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/66.jpg)
Learning Rate
● We do not replace an old Q value with a new one.
● We update at a designed learning rate.● Learning rate too small: slow to converge.● Learning rate too large: unstable.● Will Dabney PhD Thesis:
Adaptive Step-Sizes for Reinforcement Learning.
![Page 67: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/67.jpg)
Reinforcement Learning Tools
● Dynamic Programming
● Monte Carlo Methods
● Temporal Difference Learning
![Page 68: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/68.jpg)
Richard Sutton
![Page 69: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/69.jpg)
Temporal Difference Learning
● Dynamic Programming:Learn a guess from other guesses(Bootstrapping).
● Monte Carlo Methods:Learn without knowing model.
![Page 70: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/70.jpg)
Temporal Difference Learning
Temporal Difference:● Learn a guess from other guesses(Bootstrapping).
● Learn without knowing model.● Works with longer episodes than Monte
Carlo methods.
![Page 71: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/71.jpg)
Temporal Difference Learning
Monte Carlo Methods:● First run through whole episode.● Update states at end.
Temporal Difference Learning:● Update state at each step using earlier
guesses.
![Page 72: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/72.jpg)
Monte Carlo Methods
-32
-31 -21
-10 0
-11
0.95
0.025 0.025
![Page 73: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/73.jpg)
Monte Carlo Methods
-32
-31 -21
-10 0
-11
0.95
0.025 0.025
![Page 74: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/74.jpg)
Temporal Difference
0
0.95
0.025 0.025
-19 -10
-22 -18 -12
![Page 75: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/75.jpg)
Temporal Difference
-23
-28 -21
-10 0
-11
0.95
0.025 0.025
-19 -10
-22 -18 -12
![Page 76: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/76.jpg)
Temporal Difference
-23
-28 -21
-10 0
-11
0.95
0.025 0.025
-19 -10
-22 -18 -12
23 = 1 + 22
28 = 10 + 18
21 = 10 + 11
11 = 1 + 10
10 = 10 + 0
![Page 77: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/77.jpg)
Function Approximation
● Most problems have large state space.● We can generally design an approximation
for the state space.● Choosing the correct approximation has a
large influence on system performance.
![Page 78: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/78.jpg)
Mountain Car Problem
![Page 79: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/79.jpg)
Mountain Car Problem
● Car cannot make it to top.● Can can swing back and forth to gain
momentum.● We know x and ẋ.● x and ẋ give an infinite state space.● Random – may get to top in 1000 steps.● Optimal – may get to top in 102 steps.
![Page 80: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/80.jpg)
Function Approximation
● We can partition state space in 200 x 200 grid.
● Coarse coding – different ways of partitioning state space.
● We can approximate V = wT f● E.g. f = ( x ẋ height ẋ2 )T
● We can estimate w to solve problem.
![Page 81: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/81.jpg)
Problems with Reinforcement Learning
Policy sometimes gets worse:● Safe Reinforcement Learning (Phil Thomas)
guarantees an improved policy over the current policy.
Very specific to training task:● Learning Parameterized Skills
Bruno Castro da Silva PhD Thesis
![Page 82: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/82.jpg)
Checkers
● Arthur Samuel (IBM) 1959
![Page 83: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/83.jpg)
TD-Gammon
● Neural networks and temporal difference.● Current programs play better than human
experts.● Expert work
in inputselection.
![Page 84: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)](https://reader033.vdocuments.us/reader033/viewer/2022050511/5f9b6722dd1e586e154eb577/html5/thumbnails/84.jpg)
Deep Learning: Atari
● Inputs: score and pixels.● Deep learning used to discover features.● Some games
played at super-human level.
● Some gamesplayed atmediocrelevel.