cs 7643: deep learning...definitions: value function and q-value function following a policy...
TRANSCRIPT
![Page 1: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/1.jpg)
CS 7643: Deep Learning
Dhruv Batra Georgia Tech
Topics: – Review of Classical Reinforcement Learning– Value-based Deep RL– Policy-based Deep RL
![Page 2: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/2.jpg)
Types of Learning• Supervised learning
– Learning from a “teacher”– Training data includes desired outputs
• Unsupervised learning– Training data does not include desired outputs
• Reinforcement learning– Learning to act under evaluative feedback (rewards)
(C) Dhruv Batra 2
![Page 3: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/3.jpg)
Supervised Learning
Data: (x, y)x is data, y is label
Goal: Learn a function to map x -> y
Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.
Cat
Classification
This image is CC0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 4: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/4.jpg)
Unsupervised Learning
Data: xJust data, no labels!
Goal: Learn some underlying hidden structure of the data
Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation
2-d density images left and rightare CC0 public domain
1-d density estimationFigurecopyrightIanGoodfellow,2016.Reproducedwithpermission.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 5: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/5.jpg)
What is Reinforcement Learning?
Agent-oriented learning—learning by interacting with an environment to achieve a goal
• more realistic and ambitious than other kinds of machine learning
Learning by trial and error, with only delayed evaluative feedback (reward)
• the kind of machine learning most like natural learning
• learning that can tell for itself when it is right or wrong
Slide Credit: Rich Sutton
![Page 6: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/6.jpg)
David Silver 2015
![Page 7: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/7.jpg)
Example: Hajime Kimura’s RL Robots
Before After
Backward New Robot, Same algorithmSlide Credit: Rich Sutton
![Page 8: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/8.jpg)
Signature challenges of RL
Evaluative feedback (reward)
Sequentiality, delayed consequences
Need for trial and error, to explore as well as exploit
Non-stationarity
The fleeting nature of time and online data
Slide Credit: Rich Sutton
![Page 9: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/9.jpg)
RL API
(C) Dhruv Batra 9Slide Credit: David Silver
![Page 10: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/10.jpg)
State
(C) Dhruv Batra 10
![Page 11: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/11.jpg)
Robot Locomotion
Objective: Make the robot move forward
State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement
FigurescopyrightJohnSchulmanetal.,2016.Reproducedwithpermission.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 12: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/12.jpg)
Atari Games
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step
FigurescopyrightVolodymyrMnihetal.,2013.Reproducedwithpermission.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 13: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/13.jpg)
Go
Objective: Win the game!
State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise
This image is CC0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 14: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/14.jpg)
Demo
(C) Dhruv Batra 14
![Page 15: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/15.jpg)
Markov Decision Process- Mathematical formulation of the RL problem
Defined by:
: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 16: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/16.jpg)
Markov Decision Process- Mathematical formulation of the RL problem
- Life is trajectory:
Defined by:
: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 17: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/17.jpg)
Markov Decision Process- Mathematical formulation of the RL problem
- Life is trajectory:
- Markov property: Current state completely characterizes the state of the world
Defined by:
: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 18: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/18.jpg)
Components of an RL Agent• Policy
– How does an agent behave?
• Value function– How good is each state and/or state-action pair?
• Model– Agent’s representation of the environment
(C) Dhruv Batra 18
![Page 19: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/19.jpg)
Policy• A policy is how the agent acts
• Formally, map from states to actions
(C) Dhruv Batra 19
![Page 20: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/20.jpg)
What’s a good policy?
The optimal policy 𝝿*
![Page 21: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/21.jpg)
What’s a good policy?
Maximizes current reward? Sum of all future reward?
The optimal policy 𝝿*
![Page 22: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/22.jpg)
What’s a good policy?
Maximizes current reward? Sum of all future reward?
Discounted future rewards!
The optimal policy 𝝿*
![Page 23: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/23.jpg)
What’s a good policy?
Maximizes current reward? Sum of all future reward?
Discounted future rewards!
Formally:
with
The optimal policy 𝝿*
![Page 24: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/24.jpg)
Value Function• A value function is a prediction of future reward
• “State Value Function” or simply “Value Function”– How good is a state? – Am I screwed? Am I winning this game?
• “Action Value Function” or Q-function– How good is a state action-pair? – Should I do this now?
(C) Dhruv Batra 24
![Page 25: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/25.jpg)
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 26: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/26.jpg)
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 27: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/27.jpg)
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):
How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 28: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/28.jpg)
Model
(C) Dhruv Batra 28Slide Credit: David Silver
![Page 29: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/29.jpg)
Model
(C) Dhruv Batra 29
• Model predicts what the world will do next
Slide Credit: David Silver
![Page 30: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/30.jpg)
Maze Example
(C) Dhruv Batra 30Slide Credit: David Silver
![Page 31: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/31.jpg)
Maze Example: Policy
(C) Dhruv Batra 31Slide Credit: David Silver
![Page 32: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/32.jpg)
Maze Example: Value
(C) Dhruv Batra 32Slide Credit: David Silver
![Page 33: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/33.jpg)
Maze Example: Model
(C) Dhruv Batra 33Slide Credit: David Silver
![Page 34: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/34.jpg)
Components of an RL Agent• Value function
– How good is each state and/or state-action pair?
• Policy– How does an agent behave?
• Model– Agent’s representation of the environment
(C) Dhruv Batra 34
![Page 35: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/35.jpg)
Approaches to RL• Value-based RL
– Estimate the optimal action-value function
• Policy-based RL– Search directly for the optimal policy
• Model– Build a model of the world
• State transition, reward probabilities
– Plan (e.g. by look-ahead) using model
(C) Dhruv Batra 35
![Page 36: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/36.jpg)
Deep RL• Value-based RL
– Use neural nets to represent Q function
• Policy-based RL– Use neural nets to represent policy
• Model– Use neural nets to represent and learn the model
(C) Dhruv Batra 36
Q(s, a; ✓⇤) ⇡ Q⇤(s, a)
Q(s, a; ✓)
⇡✓
⇡✓⇤ ⇡ ⇡⇤
![Page 37: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/37.jpg)
Approaches to RL• Value-based RL
– Estimate the optimal action-value function
(C) Dhruv Batra 37
![Page 38: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/38.jpg)
Optimal Value Function• Optimal Q-function is the maximum achievable value
(C) Dhruv Batra 38Slide Credit: David Silver
![Page 39: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/39.jpg)
Optimal Value Function• Optimal Q-function is the maximum achievable value
• Once we have it, we can act optimally
(C) Dhruv Batra 39Slide Credit: David Silver
![Page 40: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/40.jpg)
Optimal Value Function
• Optimal value maximizes over all future decisions
(C) Dhruv Batra 40Slide Credit: David Silver
![Page 41: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/41.jpg)
Optimal Value Function
• Optimal value maximizes over all future decisions
• Formally, Q* satisfies Bellman Equations
(C) Dhruv Batra 41Slide Credit: David Silver
![Page 42: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/42.jpg)
Solving for the optimal policy
Qi will converge to Q* as i -> infinity
Value iteration algorithm: Use Bellman equation as an iterative update
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 43: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/43.jpg)
What’s the problem with this?
Solving for the optimal policy
Qi will converge to Q* as i -> infinity
Value iteration algorithm: Use Bellman equation as an iterative update
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 44: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/44.jpg)
What’s the problem with this?Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!
Solving for the optimal policy
Qi will converge to Q* as i -> infinity
Value iteration algorithm: Use Bellman equation as an iterative update
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 45: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/45.jpg)
What’s the problem with this?Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!
Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!
Solving for the optimal policy
Qi will converge to Q* as i -> infinity
Value iteration algorithm: Use Bellman equation as an iterative update
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 46: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/46.jpg)
Demo• http://cs.stanford.edu/people/karpathy/reinforcejs/grid
world_td.html
(C) Dhruv Batra 46
![Page 47: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/47.jpg)
Deep RL• Value-based RL
– Use neural nets to represent Q function
• Policy-based RL– Use neural nets to represent policy
• Model– Use neural nets to represent and learn the model
(C) Dhruv Batra 47
Q(s, a; ✓⇤) ⇡ Q⇤(s, a)
Q(s, a; ✓)
⇡✓
⇡✓⇤ ⇡ ⇡⇤
![Page 48: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/48.jpg)
Q-Networks
Slide Credit: David Silver
![Page 49: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/49.jpg)
Case Study: Playing Atari Games
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step
FigurescopyrightVolodymyrMnihetal.,2013.Reproducedwithpermission.
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 50: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/50.jpg)
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 51: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/51.jpg)
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values)
Input: state st
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 52: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/52.jpg)
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values)
Familiar conv layers, FC layer
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 53: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/53.jpg)
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 54: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/54.jpg)
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
Number of actions between 4-18 depending on Atari game
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 55: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/55.jpg)
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
Number of actions between 4-18 depending on Atari game
A single feedforward pass to compute Q-values for all actions from the current state => efficient!
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 56: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/56.jpg)
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Q⇤(s, a) = E[r + �max
a0Q⇤
(s0, a0) | s, a]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 57: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/57.jpg)
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:Forward Pass
Li(✓i) = E⇥(yi �Q(s, a; ✓i)
2⇤
Q⇤(s, a) = E[r + �max
a0Q⇤
(s0, a0) | s, a]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 58: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/58.jpg)
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:
where
Forward PassLi(✓i) = E
⇥(yi �Q(s, a; ✓i)
2⇤
Q⇤(s, a) = E[r + �max
a0Q⇤
(s0, a0) | s, a]
yi = E[r + �max
a0Q⇤
(s0, a0) | s, a]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 59: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/59.jpg)
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:
where
Forward Pass
Backward PassGradient update (with respect to Q-function parameters θ):
Li(✓i) = E⇥(yi �Q(s, a; ✓i)
2⇤
Q⇤(s, a) = E[r + �max
a0Q⇤
(s0, a0) | s, a]
yi = E[r + �max
a0Q⇤
(s0, a0) | s, a]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 60: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/60.jpg)
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:
where
Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)
Forward Pass
Backward PassGradient update (with respect to Q-function parameters θ):
Li(✓i) = E⇥(yi �Q(s, a; ✓i)
2⇤
Q⇤(s, a) = E[r + �max
a0Q⇤
(s0, a0) | s, a]
yi = E[r + �max
a0Q⇤
(s0, a0) | s, a]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 61: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/61.jpg)
Training the Q-network: Experience Replay
61
Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops
[Mnih et al. NIPS Workshop 2013; Nature 2015]
![Page 62: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/62.jpg)
Training the Q-network: Experience Replay
62
Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops
Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game
(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,
instead of consecutive samples
[Mnih et al. NIPS Workshop 2013; Nature 2015]
![Page 63: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/63.jpg)
Experience Replay
(C) Dhruv Batra 63Slide Credit: David Silver
![Page 64: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/64.jpg)
Video by Károly Zsolnai-Fehér. Reproduced with permission.https://www.youtube.com/watch?v=V1eYniJ0Rnk
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 65: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/65.jpg)
Deep RL• Value-based RL
– Use neural nets to represent Q function
• Policy-based RL– Use neural nets to represent policy
• Model– Use neural nets to represent and learn the model
(C) Dhruv Batra 65
Q(s, a; ✓⇤) ⇡ Q⇤(s, a)
Q(s, a; ✓)
⇡✓
⇡✓⇤ ⇡ ⇡⇤
![Page 66: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/66.jpg)
Formally, let’s define a class of parameterized policies:
For each policy, define its value:
Policy Gradients
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 67: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/67.jpg)
Formally, let’s define a class of parameterized policies:
For each policy, define its value:
We want to find the optimal policy
How can we do this?
Policy Gradients
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 68: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/68.jpg)
Formally, let’s define a class of parameterized policies:
For each policy, define its value:
We want to find the optimal policy
How can we do this?
Policy Gradients
Gradient ascent on policy parameters!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 69: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/69.jpg)
REINFORCE algorithmMathematically, we can write:
Where r(𝜏) is the reward of a trajectory
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 70: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/70.jpg)
REINFORCE algorithmExpected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 71: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/71.jpg)
REINFORCE algorithm
Now let’s differentiate this:
Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 72: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/72.jpg)
REINFORCE algorithm
Intractable! Expectation of gradientis problematic when p depends on θ
Now let’s differentiate this:
Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 73: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/73.jpg)
REINFORCE algorithm
Intractable! Expectation of gradientis problematic when p depends on θ
Now let’s differentiate this:
However, we can use a nice trick:
Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 74: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/74.jpg)
REINFORCE algorithm
Intractable! Expectation of gradientis problematic when p depends on θ
Can estimate with Monte Carlo sampling
Now let’s differentiate this:
However, we can use a nice trick:If we inject this back:
Expected reward:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 75: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/75.jpg)
REINFORCE algorithm
75
Can we compute those quantities without knowing the transition probabilities?
We have:
![Page 76: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/76.jpg)
REINFORCE algorithm
76
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
![Page 77: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/77.jpg)
REINFORCE algorithm
77
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
And when differentiating:Doesn’t depend on
transition probabilities!
![Page 78: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/78.jpg)
REINFORCE algorithm
78
Can we compute those quantities without knowing the transition probabilities?
We have:
Thus:
And when differentiating:
Therefore when sampling a trajectory 𝜏, we can estimate J(𝜃) with
Doesn’t depend on transition probabilities!
![Page 79: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/79.jpg)
IntuitionGradient estimator:
Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 80: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/80.jpg)
IntuitionGradient estimator:
Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 81: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/81.jpg)
Intuition
(C) Dhruv Batra 81
![Page 82: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/82.jpg)
IntuitionGradient estimator:
Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 83: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/83.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
Objective: Image Classification
Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class
- Inspiration from human perception and eye movements- Saves computational resources => scalability- Able to ignore clutter / irrelevant parts of image
State: Glimpses seen so farAction: (x,y) coordinates (center of glimpse) of where to look next in imageReward: 1 at the final timestep if image correctly classified, 0 otherwise
glimpse
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 84: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/84.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
Objective: Image Classification
Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class
- Inspiration from human perception and eye movements- Saves computational resources => scalability- Able to ignore clutter / irrelevant parts of image
State: Glimpses seen so farAction: (x,y) coordinates (center of glimpse) of where to look next in imageReward: 1 at the final timestep if image correctly classified, 0 otherwise
Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCEGiven state of glimpses seen so far, use RNN to model the state and output next action
glimpse
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 85: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/85.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
NN
(x1, y1)
Input image
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 86: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/86.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
NN
(x1, y1)
NN
(x2, y2)
Input image
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 87: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/87.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
NN
(x1, y1)
NN
(x2, y2)
NN
(x3, y3)
Input image
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 88: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/88.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
NN
(x1, y1)
NN
(x2, y2)
NN
(x3, y3)
NN
(x4, y4)
Input image
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 89: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/89.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
NN
(x1, y1)
NN
(x2, y2)
NN
(x3, y3)
NN
(x4, y4)
NN
(x5, y5)
Softmax
Input image
y=2
[Mnih et al. 2014]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 90: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/90.jpg)
REINFORCE in action: Recurrent Attention Model (RAM)
[Mnih et al. 2014]FigurescopyrightDanielLevy,2017.Reproducedwithpermission.
Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 91: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/91.jpg)
IntuitionGradient estimator:
Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen
Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!
However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 92: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/92.jpg)
Variance reductionGradient estimator:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 93: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/93.jpg)
Variance reductionGradient estimator:
First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 94: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/94.jpg)
Variance reductionGradient estimator:
First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state
Second idea: Use discount factor 𝛾 to ignore delayed effects
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 95: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/95.jpg)
Variance reduction: BaselineProblem: The raw value of a trajectory isn’t necessarily meaningful.
For example, if rewards are all positive, you keep pushing up probabilities of actions.
What is important then? Whether a reward is better or worse than what you expect to get
Idea: Introduce a baseline function dependent on the state.Concretely, estimator is now:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 96: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/96.jpg)
How to choose the baseline?
A simple baseline: constant moving average of rewards experienced so far from all trajectories
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 97: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/97.jpg)
How to choose the baseline?
A simple baseline: constant moving average of rewards experienced so far from all trajectories
Variance reduction techniques seen so far are typically used in “Vanilla REINFORCE”
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 98: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/98.jpg)
How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
Q: What does this remind you of?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 99: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/99.jpg)
How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
Q: What does this remind you of?
A: Q-function and value function!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 100: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/100.jpg)
How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
Q: What does this remind you of?
A: Q-function and value function!Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 101: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/101.jpg)
How to choose the baseline?A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.
Q: What does this remind you of?
A: Q-function and value function!Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.
Using this, we get the estimator:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 102: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/102.jpg)
Actor-Critic AlgorithmInitialize policy parameters 𝜃, critic parameters 𝜙For iteration=1, 2 … do
Sample m trajectories under the current policy
For i=1, …, m doFor t=1, ... , T do
End for
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 103: CS 7643: Deep Learning...Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, … How good is a](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f6b617e708231d444118d/html5/thumbnails/103.jpg)
Summary- Policy gradients: very general but suffer from high variance so
requires a lot of samples. Challenge: sample-efficiency- Q-learning: does not always work but when it works, usually more
sample-efficient. Challenge: exploration
- Guarantees:- Policy Gradients: Converges to a local minima of J(𝜃), often good
enough!- Q-learning: Zero guarantees since you are approximating Bellman
equation with a complicated function approximator
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n