topics in ai (cpsc 532l)

88
Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound

Upload: others

Post on 09-May-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topics in AI (CPSC 532L)

Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound

Page 2: Topics in AI (CPSC 532L)

Types of Learning

Supervised training — Learning from the teacher — Training data includes desired output

Unsupervised training — Training data does not include desired output

Reinforcement learning — Learning to act under evaluative feedback (rewards)

* slide from Dhruv Batra

Page 3: Topics in AI (CPSC 532L)

What is Reinforcement Learning

Agent-oriented learning — learning by interacting with an environment to achieve a goal

— More realizing and ambitious than other kinds of machine learning

Learning by trial and error, with only delayed evaluative feedback (reward)

— The kind go machine learning most like natural learning — Learning that can tell for itself when it is right or wrong

* slide from David Silver

Page 4: Topics in AI (CPSC 532L)

Example: Hajime Kimura’s RL Robot

Before After

* slide from Rich Sutton

Page 5: Topics in AI (CPSC 532L)

Example: Hajime Kimura’s RL Robot

Before After

* slide from Rich Sutton

Page 6: Topics in AI (CPSC 532L)

Example: Hajime Kimura’s RL Robot

Before After

* slide from Rich Sutton

Page 7: Topics in AI (CPSC 532L)

Challenges of RL

— Evaluative feedback (reward) — Sequentiality, delayed consequences — Need for trial and error, to explore as well as exploit — Non-stationarity — The fleeting nature of time and online data

* slide from Rich Sutton

Page 8: Topics in AI (CPSC 532L)

How does RL work?

* slide from David Silver

Page 9: Topics in AI (CPSC 532L)

Robot Locomotion

Objective: Make the robot move forward

State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 10: Topics in AI (CPSC 532L)

Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 11: Topics in AI (CPSC 532L)

Go Game (AlphaGo)

Objective: Win the game!

State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 12: Topics in AI (CPSC 532L)

Markov Decision Processes — Mathematical formulation of the RL problem

Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 13: Topics in AI (CPSC 532L)

Markov Decision Processes — Mathematical formulation of the RL problem

— Life is trajectory:

Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 14: Topics in AI (CPSC 532L)

Markov Decision Processes — Mathematical formulation of the RL problem

— Life is trajectory:

— Markov property: Current state completely characterizes the state of the world

Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 15: Topics in AI (CPSC 532L)

Components of the RL Agent

Policy — How does the agent behave?

Value Function — How good is each state and/or action pair?

Model — Agent’s representation of the environment

* slide from Dhruv Batra

Page 16: Topics in AI (CPSC 532L)

Policy

— The policy is how the agent acts — Formally, map from states to actions:

Deterministic policy:Stochastic policy:

* slide from Dhruv Batra

Page 17: Topics in AI (CPSC 532L)

Policy

— The policy is how the agent acts — Formally, map from states to actions:

Deterministic policy:Stochastic policy:

* slide from Dhruv Batra

Page 18: Topics in AI (CPSC 532L)

The Optimal Policy

What is a good policy?

* slide from Dhruv Batra

Page 19: Topics in AI (CPSC 532L)

The Optimal Policy

What is a good policy?

Maximizes current reward? Sum of all future rewards?

* slide from Dhruv Batra

Page 20: Topics in AI (CPSC 532L)

The Optimal Policy

What is a good policy?

Maximizes current reward? Sum of all future rewards?

Discounted future rewards!

* slide from Dhruv Batra

Page 21: Topics in AI (CPSC 532L)

The Optimal Policy

What is a good policy?

Maximizes current reward? Sum of all future rewards?

Discounted future rewards!

Formally:

with

* slide from Dhruv Batra

Page 22: Topics in AI (CPSC 532L)

Components of the RL Agent

Policy — How does the agent behave?

Value Function — How good is each state and/or action pair?

Model — Agent’s representation of the environment

* slide from Dhruv Batra

Page 23: Topics in AI (CPSC 532L)

Value Function

A value function is a prediction of future reward

“State Value Function" or simps “Value Function” — How good id a state? — Am I screwed? Am I winning this game?

“Action Value Function” or Q-function — How good is a state action-pair? — Should I do this now?

* slide from Dhruv Batra

Page 24: Topics in AI (CPSC 532L)

Value Function and Q-value FunctionFollowing a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

— The value function (how good is the state) at state s, is the expected cumulative reward from state s (and following the policy thereafter):

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 25: Topics in AI (CPSC 532L)

Value Function and Q-value FunctionFollowing a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

— The value function (how good is the state) at state s, is the expected cumulative reward from state s (and following the policy thereafter):

— The Q-value function (how good is a state-action pair) at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 26: Topics in AI (CPSC 532L)

Components of the RL Agent

Policy — How does the agent behave?

Value Function — How good is each state and/or action pair?

Model — Agent’s representation of the environment

* slide from Dhruv Batra

Page 27: Topics in AI (CPSC 532L)

Model

Model predicts what the world will do next

* slide from David Silver

Page 28: Topics in AI (CPSC 532L)

Model

Model predicts what the world will do next

* slide from David Silver

Page 29: Topics in AI (CPSC 532L)

Components of the RL Agent

Policy — How does the agent behave?

Value Function — How good is each state and/or action pair?

Model — Agent’s representation of the environment

* slide from Dhruv Batra

Page 30: Topics in AI (CPSC 532L)

Maze Example

Reward: -1 per time-step Actions: N, E, S, W States: Agent’s location

* slide from David Silver

Page 31: Topics in AI (CPSC 532L)

Maze Example: Policy

Arrows represent a policy for each state

⇡(s)s

* slide from David Silver

Page 32: Topics in AI (CPSC 532L)

Maze Example: Value

Numbers represent value of each state s

v⇡(s)

* slide from David Silver

Page 33: Topics in AI (CPSC 532L)

Maze Example: Model

Grid layout represents transition model

Numbers represent the immediate reward for each state (same for all states)

* slide from David Silver

Page 34: Topics in AI (CPSC 532L)

Components of the RL Agent

Policy — How does the agent behave?

Value Function — How good is each state and/or action pair?

Model — Agent’s representation of the environment

* slide from Dhruv Batra

Page 35: Topics in AI (CPSC 532L)

Approaches to RL: Taxonomy

Value-based RL — Estimate the optimal action-value function — No policy (implicit)

Policy-based RL — Search directly for the optima policy — No value function

Model-based RL — Build a model of the world — Plan (e.g., by look-ahead) using model

Q⇤(s, a)

⇡⇤

Model-free RL

* slide from Dhruv Batra

Page 36: Topics in AI (CPSC 532L)

Approaches to RL: Taxonomy

Value-based RL — Estimate the optimal action-value function — No policy (implicit)

Policy-based RL — Search directly for the optima policy — No value function

Model-based RL — Build a model of the world — Plan (e.g., by look-ahead) using model

Q⇤(s, a)

⇡⇤

Actor-critic RL — Value function — Policy function

Model-free RL

* slide from Dhruv Batra

Page 37: Topics in AI (CPSC 532L)

Deep RL

Value-based RL — Use neural nets to represent Q function

Policy-based RL — Use neural nets to represent the policy

Model-based RL — Use neural nets to represent and learn the model

* slide from Dhruv Batra

Page 38: Topics in AI (CPSC 532L)

Approaches to RL

Value-based RL — Estimate the optimal action-value function — No policy (implicit)

Q⇤(s, a)

* slide from Dhruv Batra

Page 39: Topics in AI (CPSC 532L)

Optimal Value FunctionOptimal Q-function is the maximum achievable value

* slide from David Silver

Page 40: Topics in AI (CPSC 532L)

Optimal Value FunctionOptimal Q-function is the maximum achievable value

Once we have it, we can act optimally

* slide from David Silver

Page 41: Topics in AI (CPSC 532L)

Optimal Value FunctionOptimal Q-function is the maximum achievable value

Once we have it, we can act optimally

Optimal value maximizes over all future decisions

* slide from David Silver

Page 42: Topics in AI (CPSC 532L)

Optimal Value FunctionOptimal Q-function is the maximum achievable value

Once we have it, we can act optimally

Optimal value maximizes over all future decisions

Formally, Q* satisfied Bellman Equations

* slide from David Silver

Page 43: Topics in AI (CPSC 532L)

Solving for the Optimal PolicyValue iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 44: Topics in AI (CPSC 532L)

Solving for the Optimal PolicyValue iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

What’s the problem with this?

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 45: Topics in AI (CPSC 532L)

Solving for the Optimal PolicyValue iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. game pixels, computationally infeasible to compute for entire state space!

What’s the problem with this?

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 46: Topics in AI (CPSC 532L)

Solving for the Optimal PolicyValue iteration algorithm: Use Bellman equation as an iterative update

Qi will converge to Q* as i -> infinity

Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. game pixels, computationally infeasible to compute for entire state space!

What’s the problem with this?

Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 47: Topics in AI (CPSC 532L)

Q-Networks

* slide from David Silver

Page 48: Topics in AI (CPSC 532L)

Case Study: Playing Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

[ Mnih et al., 2013; Nature 2015 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 49: Topics in AI (CPSC 532L)

Q-Network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

: neural network with weights

[ Mnih et al., 2013; Nature 2015 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 50: Topics in AI (CPSC 532L)

Q-Network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

: neural network with weights

Input: state st

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2013; Nature 2015 ]

Page 51: Topics in AI (CPSC 532L)

Q-Network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

: neural network with weights

familiar conv and fc layers

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2013; Nature 2015 ]

Page 52: Topics in AI (CPSC 532L)

Q-Network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

: neural network with weights

Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2013; Nature 2015 ]

Page 53: Topics in AI (CPSC 532L)

Q-Network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

: neural network with weights

Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2013; Nature 2015 ]

Number of actions between 4-18 depending on Atari game

Page 54: Topics in AI (CPSC 532L)

Q-Network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

: neural network with weights

Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2013; Nature 2015 ]

Number of actions between 4-18 depending on Atari game

A single feedforward pass to compute Q-values for all actions from the current state => efficient!

Page 55: Topics in AI (CPSC 532L)

Q-Network LearningRemember: want to find a Q-function that satisfies the Bellman Equation:

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 56: Topics in AI (CPSC 532L)

Q-Network LearningRemember: want to find a Q-function that satisfies the Bellman Equation:

Loss function:

where

Forward Pass:

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 57: Topics in AI (CPSC 532L)

Q-Network LearningRemember: want to find a Q-function that satisfies the Bellman Equation:

Loss function:

where

Forward Pass:

Backward Pass: Gradient update (with respect to Q-function parameters θ):

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 58: Topics in AI (CPSC 532L)

Q-Network LearningRemember: want to find a Q-function that satisfies the Bellman Equation:

Loss function:

where

Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)

Forward Pass:

Backward Pass: Gradient update (with respect to Q-function parameters θ):

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 59: Topics in AI (CPSC 532L)

Training the Q-Network: Experience ReplayLearning from batches of consecutive samples is problematic:

— Samples are correlated => inefficient learning — Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

Address these problems using experience replay — Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played — Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 60: Topics in AI (CPSC 532L)

Experience Replay

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 61: Topics in AI (CPSC 532L)

Experience Replay

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 62: Topics in AI (CPSC 532L)

Example: Atari Playing

Page 63: Topics in AI (CPSC 532L)

Example: Atari Playing

Page 64: Topics in AI (CPSC 532L)

Deep RL

Value-based RL — Use neural nets to represent Q function

Policy-based RL — Use neural nets to represent the policy

Model-based RL — Use neural nets to represent and learn the model

* slide from Dhruv Batra

Page 65: Topics in AI (CPSC 532L)

Deep RL

Value-based RL — Use neural nets to represent Q function

Policy-based RL — Use neural nets to represent the policy

Model-based RL — Use neural nets to represent and learn the model

* slide from Dhruv Batra

Page 66: Topics in AI (CPSC 532L)

Policy GradientsFormally, let’s define a class of parameterized policies:

For each policy, define its value:

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 67: Topics in AI (CPSC 532L)

Policy GradientsFormally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 68: Topics in AI (CPSC 532L)

Policy GradientsFormally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

How can we do this?

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 69: Topics in AI (CPSC 532L)

Policy GradientsFormally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

Gradient ascent on policy parameters!

How can we do this?

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 70: Topics in AI (CPSC 532L)

REINFORCE algorithmExpected reward:

Where r(𝜏) is the reward of a trajectory

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 71: Topics in AI (CPSC 532L)

REINFORCE algorithm

Intractable! Expectation of gradient is problematic when p depends on θ

Now let’s differentiate this:

Expected reward:

Where r(𝜏) is the reward of a trajectory

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 72: Topics in AI (CPSC 532L)

REINFORCE algorithm

Can estimate with Monte Carlo sampling

Now let’s differentiate this:

However, we can use a nice trick:

If we inject this back:

Expected reward:

Where r(𝜏) is the reward of a trajectory

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 73: Topics in AI (CPSC 532L)

Gradient estimator:

Interpretation: - If r(𝜏) is high, push up the probabilities of the actions seen - If r(𝜏) is low, push down the probabilities of the actions seen

Intuition

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 74: Topics in AI (CPSC 532L)

Gradient estimator:

Interpretation: - If r(𝜏) is high, push up the probabilities of the actions seen - If r(𝜏) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!

Intuition

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 75: Topics in AI (CPSC 532L)

Intuition

* slide from Dhruv Batra

Page 76: Topics in AI (CPSC 532L)

Gradient estimator:

Interpretation: - If r(𝜏) is high, push up the probabilities of the actions seen - If r(𝜏) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!

Intuition

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator?

Page 77: Topics in AI (CPSC 532L)

REINFORCE in Action: Recurrent Attention Model (REM)Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

— Inspiration from human perception and eye movements — Saves computational resources => scalability — Able to ignore clutter / irrelevant parts of image

glimpse

[ Mnih et al., 2014 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 78: Topics in AI (CPSC 532L)

State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise

REINFORCE in Action: Recurrent Attention Model (REM)Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

— Inspiration from human perception and eye movements — Saves computational resources => scalability — Able to ignore clutter / irrelevant parts of image

glimpse

[ Mnih et al., 2014 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 79: Topics in AI (CPSC 532L)

State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise

REINFORCE in Action: Recurrent Attention Model (REM)Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

— Inspiration from human perception and eye movements — Saves computational resources => scalability — Able to ignore clutter / irrelevant parts of image

glimpse

Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action

[ Mnih et al., 2014 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 80: Topics in AI (CPSC 532L)

NN

(x1, y1)

Input image

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Page 81: Topics in AI (CPSC 532L)

NN

(x1, y1)

NN

(x2, y2)

Input image

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Page 82: Topics in AI (CPSC 532L)

NN

(x1, y1)

NN

(x2, y2)

NN

(x3, y3)

Input image

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Page 83: Topics in AI (CPSC 532L)

NN

(x1, y1)

NN

(x2, y2)

NN

(x3, y3)

NN

(x4, y4)

NN

(x5, y5)

Softmax

Input image y=2

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Page 84: Topics in AI (CPSC 532L)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

Page 85: Topics in AI (CPSC 532L)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

Page 86: Topics in AI (CPSC 532L)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

Page 87: Topics in AI (CPSC 532L)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

[ Mnih et al., 2014 ]

REINFORCE in Action: Recurrent Attention Model (REM)

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

Page 88: Topics in AI (CPSC 532L)

Summary

Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency

Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration

Guarantees: — Policy Gradients: Converges to a local minima of J(𝜃), often good enough! — Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford