9 active reinforcement learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/fall18/slide_pdfs/9...
TRANSCRIPT
![Page 1: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/1.jpg)
Active Reinforcement Learning
Virginia Tech CS4804
![Page 2: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/2.jpg)
Outline
• Active reinforcement learning
• Active adaptive dynamic programming
• Q-learning
• Policy Search
![Page 3: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/3.jpg)
Passive Learning• Recordings of agent running fixed policy
• Observe states, rewards, actions
• Direct utility estimation
• Adaptive dynamic programming (ADP)
• Temporal-difference (TD) learning
![Page 4: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/4.jpg)
Problems with Passive Reinforcement Learning
⇡(s) = argmaxa
P (s0|s, a)U(s0)
![Page 5: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/5.jpg)
Exploration• Naive approach: randomly choose random action
• shrink probability of random action over time
• Another approach: always act greedy, but overestimate rewards for unexplored states
U+(s) R(s) + �maxa
f
X
s0
P (s0|s, a)U+(s0), N(s, a)
!
![Page 6: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/6.jpg)
• shrink probability of random action over time
• Another approach: always act greedy, but overestimate rewards for unexplored states
f(u, n) =
(R+ if n < Ne
u otherwise
U+(s) R(s) + �maxa
f
X
s0
P (s0|s, a)U+(s0), N(s, a)
!
![Page 7: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/7.jpg)
Active TD-Learning• Exactly the same as non-active TD learning
U(s) U(s) + ↵(R(s) + �U(s0)� U(s))
U(s) = R(s) + �Es0 [U(s0)]
• Still need estimates of transition probabilities
![Page 8: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/8.jpg)
Action-Utility Functions• Combine transition probabilities with utilities
Q(s, a) = R(s) + �X
s0
P (s0|s, a)maxa
Q(s0, a0)
U(s) = R(s) + �maxa
X
s0
P (s0|s, a)U(s0)
U(s) = maxa
Q(s, a)
![Page 9: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/9.jpg)
Q-LearningQ(s, a) = R(s) + �
X
s0
P (s0|s, a)maxa
Q(s0, a0)
U(s) = R(s) + �maxa
X
s0
P (s0|s, a)U(s0)
U(s) = maxa
Q(s, a)
Q(s, a) Q(s, a) + ↵(R(s) + �maxa0
Q(s0, a0)�Q(s, a))
U(s) U(s) + ↵(R(s) + �U(s0)� U(s))
![Page 10: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/10.jpg)
Approximate Q-LearningQ̂(s, a) := g(s, a,✓) := ✓1f1(s, a) + ✓2f2(s, a) + . . .+ ✓dfd(s, a)
✓i ✓i + ↵⇣R(s) + �max
a0Q̂(s0, a0)� Q̂(s, a)
⌘ @g
@✓i
✓i ✓i + ↵⇣R(s) + �max
a0Q̂(s0, a0)� Q̂(s, a)
⌘fi(s, a)
![Page 11: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/11.jpg)
Loss Minimization for Parameterized Q
Q(s, a) Q(s, a) + ↵(R(s) + �maxa0
Q(s0, a0)�Q(s, a))
`(s, s0) = `⇣R(s) + �max
a0Q(s0, a0)�Q(s, a)
⌘
rQ`(s, s0)Compute with backpropagation, do stochastic gradient descent.
Estimation error:
![Page 12: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/12.jpg)
Policy Search• Instead, parameterize policy as a condition probability distribution
• Tune parameters by approximating the gradient of policy
• Gradient descent
• If curious, read https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
πθ(a |s)
![Page 13: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/13.jpg)
Summary• Active ADP:
• choose random actions or act greedy and overestimate utility
• Active TD: exactly the same as standard TD
• Q-learning: learn action-reward function directly, via TD learning
• Approximation uses derivative of utility function, easy if linear
• Policy search
![Page 14: 9 Active Reinforcement Learning - courses.cs.vt.educourses.cs.vt.edu/cs4804/Fall18/slide_pdfs/9 Active... · 9 Active Reinforcement Learning Created Date: 9/21/2018 3:53:17 PM](https://reader033.vdocuments.us/reader033/viewer/2022053017/5f1ad338d7c701034c6510b3/html5/thumbnails/14.jpg)
SummaryDirect Utility Estimation ADP TD Learning Q-Learning Policy Search
(Gradient)
Learns p(s’ | s, a) No Yes, by counting No No No
Learns R(s) No Yes No No No
Learns state utility (U) U𝝅 optimal U optimal U if active.
U𝝅 otherwise No No
Learns state-action utility (Q) No No No Yes No
Learns 𝝅 No Yes No (need p) Yes Yes
Strategy Average observed utility
Solve Bellman w/ estimated p & R
Adjust U based on change from s to s’
TD learning for Q instead of U Fancy calculus