chapter 5: monte carlo methods - umass amherstbarto/courses/cs687/chapter 5.pdf · chapter 5: monte...
TRANSCRIPT
![Page 1: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/1.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 5: Monte Carlo Methods
! Monte Carlo methods learn from complete sample returns
! Only defined for episodic tasks
! Monte Carlo methods learn directly from experience
! On-line: No model necessary and still attains optimality
! Simulated: No need for a full model
![Page 2: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/2.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
Monte Carlo Policy Evaluation
! Goal: learn V!(s)
! Given: some number of episodes under ! which contain s
! Idea: Average returns observed after visits to s
! Every-Visit MC: average returns for every time s is visited
in an episode
! First-visit MC: average returns only for first time s is
visited in an episode
! Both converge asymptotically
1 2 3 4 5
![Page 3: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/3.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3
First-visit Monte Carlo policy evaluation
![Page 4: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/4.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Blackjack example
! Object: Have your card sum be greater than the dealers
without exceeding 21.
! States (200 of them):
! current sum (12-21)
! dealer’s showing card (ace-10)
! do I have a useable ace?
! Reward: +1 for winning, 0 for a draw, –1 for losing
! Actions: stick (stop receiving cards), hit (receive another
card)
! Policy: Stick if my sum is 20 or 21, else hit
![Page 5: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/5.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5
Blackjack value functions
![Page 6: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/6.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Backup diagram for Monte Carlo
! Entire episode included
! Only one choice at each state
(unlike DP)
! MC does not bootstrap
! Time required to estimate one
state does not depend on the
total number of states
![Page 7: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/7.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
e.g., Elastic Membrane (Dirichlet Problem)
The Power of Monte Carlo
How do we compute the shape of the membrane or bubble?
![Page 8: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/8.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
Relaxation Kakutani!s algorithm, 1945
Two Approaches
![Page 9: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/9.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9
Monte Carlo Estimation of Action Values (Q)
! Monte Carlo is most useful when a model is not available
! We want to learn Q*
! Q!(s,a) - average return starting from state s and action a
following !
! Also converges asymptotically if every state-action pair is
visited
! Exploring starts: Every state-action pair has a non-zero
probability of being the starting pair
![Page 10: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/10.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
Monte Carlo Control
! MC policy iteration: Policy evaluation using MC methods
followed by policy improvement
! Policy improvement step: greedify with respect to value
(or action-value) function
![Page 11: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/11.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
Convergence of MC Control
! Greedified policy meets conditions for policy improvement:
Q! k (s,! k+1(s)) = Q! k (s, argmax
aQ! k (s, a))
= maxaQ! k (s,a)
"Q! k (s,! k (s))
= V! k (s)
! And thus must be by the policy improvement theorem
! This assumes exploring starts and infinite number of episodes
for MC policy evaluation
! To solve the latter:
! update only to a given level of performance
! alternate between evaluation and improvement per episode
!
" #k
![Page 12: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/12.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12
Monte Carlo Exploring Starts
Fixed point is optimal
policy !*
Proof is open question
![Page 13: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/13.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Blackjack example continued
! Exploring starts
! Initial policy as described before
![Page 14: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/14.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14
On-policy Monte Carlo Control
)(1
sA
!! +"
greedy
)(sA
!
non-max
! On-policy: learn about policy currently executing
! How do we get rid of exploring starts?
! Need soft policies: !(s,a) > 0 for all s and a
! e.g. !-soft policy:
! Similar to GPI: move policy towards greedy policy (i.e. !-
soft)
! Converges to best !-soft policy
![Page 15: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/15.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15
On-policy MC Control
![Page 16: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/16.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16
Off-policy Monte Carlo control
! Behavior policy generates behavior in environment
! Estimation policy is policy being learned about
! Average returns from behavior policy by probability their
probabilities in the estimation policy
![Page 17: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/17.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Learning about ! while following ! "
![Page 18: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/18.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18
Off-policy MC control
![Page 19: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/19.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19
Incremental Implementation
! MC can be implemented incrementally
! saves memory
! Compute the weighted average of each return
Vn=
wkRk
k =1
n
!
wk
k=1
n
!
[ ]
000
11
1
1
1
1
==
+=
!+=
++
+
+
++
WV
wWW
VRW
wVV
nnn
nn
n
n
nn
incremental equivalentnon-incremental
![Page 20: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/20.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Racetrack Exercise
! States: grid squares, velocity
horizontal and vertical
! Rewards: -1 on track, -5 off
track
! Actions: +1, -1, 0 to velocity
! 0 < Velocity < 5
! Stochastic: 50% of the time it
moves 1 extra square up or right
![Page 21: Chapter 5: Monte Carlo Methods - UMass Amherstbarto/courses/cs687/Chapter 5.pdf · Chapter 5: Monte Carlo Methods!Monte Carlo methods learn from complete sample returns! ... An Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022800/5c6d78ed09d3f214088bbcfb/html5/thumbnails/21.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21
Summary
! MC has several advantages over DP:
! Can learn directly from interaction with environment
! No need for full models
! No need to learn about ALL states
! Less harm by Markovian violations (later in book)
! MC methods provide an alternate policy evaluation
process
! One issue to watch for: maintaining sufficient exploration
! exploring starts, soft policies
! No bootstrapping (as opposed to DP)