policy gradients + planning · rajbir kataria, zhizhong li, and tanmay gupta. background...
TRANSCRIPT
![Page 1: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/1.jpg)
Policy Gradients + PlanningRajbir Kataria, Zhizhong Li, and Tanmay Gupta
![Page 2: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/2.jpg)
Background● Action-value function using parameters θ
● Policy was generated from the Q(s,a)
● We will focus on parameterizing the policy directly:
![Page 3: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/3.jpg)
Overview● Motivation● Policy Gradients
○ REINFORCE■ Simple Statistical Gradient-Following Algorithms for. Connectionist Reinforcement
Learning
○ Actor-critic methods: REINFORCE + e.g. Q-learning■ Asynchronous Advantage Actor-Critic (A3C)
● Model-based learning○ Planning
■ Value Iteration Networks
● Applications■ Recurrent Models of Visual Attention■ End-to-end Learning of Action Detection from Frame Glimpses in Videos■ Alpha-Go
![Page 4: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/4.jpg)
Motivation: Iterated Rock-Paper-Scissors● Consider value-function based policies for iterated
rock-paper-scissors
● Optimal Policy?
Slide from David Silver
Random
![Page 5: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/5.jpg)
● The agent cannot distinguish the grey states
Motivation: Aliased Gridworld
Slide from David Silver
● Optimal deterministic policy?○ Move Left in both grey states○ Move Right in both grey states
![Page 6: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/6.jpg)
● An optimal policy will randomly move E or W in grey states
Motivation: Aliased Gridworld
Slide from David Silver
● Policy-based RL can learn the optimal stochastic policy!
![Page 7: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/7.jpg)
○ Better convergence properties
○ Effective in high-dimensional or continuous action spaces
Policy-Based RL
Slide from David Silver
● Advantages:○ Can learn stochastic policies that are useful for
POMDP environments
● Disadvantages:○ Evaluating a policy is typically inefficient and high
variance --- naive Monte Carlo sampling
![Page 8: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/8.jpg)
○ Hill climbing
● Find θ that maximizes J(θ)
Policy Optimization
Slide from David Silver
● Policy based reinforcement learning is an optimization problem
● Some approaches do not use gradient
![Page 9: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/9.jpg)
Salimans et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning arXiv:1703.03864v1
● Highest scoring parameter vectors are then recombined to form the population for the next generation
Evolution Strategies - Hill Climbing● At every iteration (“generation”)
○ Population of parameter vectors (“genotypes”) is perturbed (“mutated”)
○ Objective function value (“fitness”) is evaluated
● Gradient Free!
![Page 10: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/10.jpg)
Salimans et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning arXiv:1703.03864v1
Evolution Strategies - Hill Climbing● Highly parallelizable
![Page 11: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/11.jpg)
Salimans et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning arXiv:1703.03864v1
Evolution Strategies - Results
![Page 12: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/12.jpg)
○ Genetic algorithms○ Hill climbing
● Find θ that maximizes J(θ)
Policy Optimization
Slide from David Silver
● Policy based reinforcement learning is an optimization problem
● Some approaches do not use gradient
○ Quasi-newton○ Gradient Descent
● Greater efficiency often possible using gradient
● From now on, we focus primarily on Gradient Descent
![Page 13: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/13.jpg)
Δθ = α∇θJ(θ)
● Policy gradient algorithms search for a local maximum in J(θ)
Policy Gradient
Slide from David Silver
● Let J(θ) be any policy objective function
● Where ∇θJ(θ) is the policy gradient○ α is a step-size parameter
![Page 14: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/14.jpg)
Policy Gradient Theorem
![Page 15: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/15.jpg)
Williams et al. Simple Statistical Gradient-Following Algorithms for. Connectionist Reinforcement Learning. Machine Learning, 8(3):229-256, 1992
REINFORCE
● Maximizing J is non-trivial○ Expectation over high-dimensional action sequences
R
![Page 16: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/16.jpg)
Connection with value learning:Actor-critic methods
![Page 17: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/17.jpg)
Motivation: PG vs value functions● Q-learning: learns ( , ) (action-value function)
● PG: directly learns policy ( , )○ Pro:
Better convergenceCan learn stochastic policyGet action directly; compact
○ Con: suffers from high variance when training
![Page 18: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/18.jpg)
Motivation: PG vs value functions● Q-learning: learns ( , ) (action-value function)
● PG: directly learns policy ( , )○ Con: suffers from high variance when training
... reduce variance?
Action-value Function
Figure from David Silver
REINFORCEe.g. Q-learning
![Page 19: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/19.jpg)
Method outline● Use Q to reduce variance
○ Recall gradient descent in PG:
Slide from David Silver
Future returnfrom experience
(real world samples)
Also future return!Learned along
with
]
○ Many equivalent forms
![Page 20: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/20.jpg)
Example (1): A3CAsynchronous Advantage Actor-Critic
● Bias from Q actor-critic
○ Encourages action if ( , ) is large ○ Should encourage good action on state
(not just random actions that happen on good state)
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
![Page 21: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/21.jpg)
Example (1): A3CAsynchronous Advantage Actor-Critic
● Advantage actor-critic
○ Only counts the advantage (return minus baseline)
○ Reduces variance○ Learn , normally; learn by replacing with .
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
Encourage doing better
than "baseline"
(in practice)
![Page 22: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/22.jpg)
Example (1): A3CAsynchronous Advantage Actor-Critic
● Deep Q Network:
○ (to reduce correlation in training data -- crucial for DQN)● Experience from past policy
○ Applies to off-policy learning only○ Cannot apply to e.g. actor-critic!
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
ReplayBuffer
(experience)TrainingEnvironment
GPUparallelism
Update
![Page 23: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/23.jpg)
Example (1): A3CAsynchronous Advantage Actor-Critic
● Asynchronous RL:
○ (Also reduces correlation in training data!)
● Experience is on-policy
TrainingTrainingTraining
Environment (1)Environment (2)Environment (n)
Asynchronously update and
CPU parallelismexperience (1)experience (2)experience (n)
ReplayBuffer
(experience)
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
* cf. T. Salimans et al. Evolution Strategies as a Scalable Alternative to RL
![Page 24: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/24.jpg)
Example (1): A3C● Implementation details
○ Use k-step estimate of advantage
○ Actor/critic share some layers○ Entropy regularization○ Asynchronous RMSProp
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
Rewardobtained
Estimate@ futuretime step
Baselinereturn
![Page 26: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/26.jpg)
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
Example (1): A3C● Results on Atari games (averaged)
Human normalized scores
![Page 27: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/27.jpg)
Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016
Example (1): A3C● Results
Score w.r.t. Training time (hrs).Note: hyperparameter fiddling may be at play
![Page 28: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/28.jpg)
Example (2): Continuous controlBefore: model ( , ) or ( , ) by enumerating
● When is continuous...○ Actor-critic!
Lillicrap et al. Continuous Control with Deep Reinforcement Learning. ICLR 2016
model ( 1),..., ( n)
Fit normallyUpdate
to maximize
Actor( )
Critic( , )
expected return
(∊ℝ )
![Page 30: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/30.jpg)
Planning
![Page 31: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/31.jpg)
The story so far
● Model-free RL○ Q-Learning / Sarsa:
Learn action-value function directly from experience
○ Policy Gradient:
Learn policy directly from experience
![Page 32: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/32.jpg)
● Model-based RL○ Learn a model of the environment○ Use the model to learn policy/value function
The story so far
Planning
![Page 33: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/33.jpg)
Why Plan?
● Simulation cheaper than real interaction● Speed up learning● Generalize to new environments● Predict a future even
![Page 34: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/34.jpg)
Why Plan?
● Simulation cheaper than real interaction○ Planning based Q-Learning
● Speed up learning○ Dyna-Q
● Generalize to new environments○ Value Iteration Networks
● Predict a future even○ The Predictron
![Page 35: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/35.jpg)
Simplest Model-based RL
![Page 36: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/36.jpg)
Simplest Model-based RL
MDP with Unknown○ Rewards ○ Transition Probabilities
![Page 37: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/37.jpg)
Simplest Model-based RL
Solution:○ Gain experience
![Page 38: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/38.jpg)
Simplest Model-based RL
Solution:○ Gain experience
○ Estimate model
![Page 39: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/39.jpg)
Simplest Model-based RL
Solution:○ Gain experience
○ Estimate model
![Page 40: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/40.jpg)
Simplest Model-based RL
Use the estimated MDP to get optimal policy/value function
● Value Iteration● Policy Iteration
![Page 41: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/41.jpg)
Sampling-based Planning with Q Learning
Given: An estimated MDP
Algorithm:
1. Randomly sample a state and action, 2. Sample3. Update Q function
4. Repeat
Learning from simulated experience
What if the model is incorrect?
![Page 42: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/42.jpg)
Dyna-Q
Learning from both real and simulated experience
D. Silver. RL course Lecture 8 http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
![Page 43: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/43.jpg)
Dyna-Q
Learning from both real and simulated experience
D. Silver. RL course Lecture 8 http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
![Page 44: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/44.jpg)
Dyna-Q
D. Silver. RL course Lecture 8 http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
![Page 45: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/45.jpg)
Generalization to novel environments
Learn OptimalPolicy / Value Function
Learn OptimalPolicy / Value Function
D. Silver. RL course Lecture 8 http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
![Page 46: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/46.jpg)
Generalization to novel environments
Policies trained using traditional CNNs are Reactive
State Representation
Policy
Q Network /Policy Network
Learning to React vs
Learning to Plan
![Page 47: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/47.jpg)
Value Iteration NetworkBest Paper NIPS 2016
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 48: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/48.jpg)
Value Iteration Networks
Policy Network
Policy Network Optimal Policy
Learning
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 49: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/49.jpg)
& transition
Value Iteration Networks
MDP Solve using Value Iteration
Select relevant information(Attention)
"State ovservation"
(Estimate of the real new M)
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
Planning!
![Page 50: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/50.jpg)
& transition
Value Iteration Networks
MDP Solve using Value Iteration
Select relevant information(Attention)
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
Planning!
Make it End-to-End DifferentiableQuestions?
![Page 51: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/51.jpg)
Value Iteration Networks
MDP Rewards:
Transition probabilities(same for all maps!):
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 52: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/52.jpg)
Value Iteration Networks
MDP Solve using Value Iteration
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
Planning!
![Page 53: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/53.jpg)
Value Iteration Networks
MDP Solve using Value Iteration
Conv KernelTamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
Planning!
![Page 54: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/54.jpg)
Value Iteration Networks
MDP Solve using Value Iteration
Conv Max PoolV Q V
R
R: m×n×1 Q: m×n×aV: m×n×1 Conv: 3×3×a Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
Planning!
Questions?
![Page 55: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/55.jpg)
Value Iteration Networks
Solve using Value Iteration
Select relevant information(Attention)
Attention:Select
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 56: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/56.jpg)
Value Iteration Networks
Solve using Value Iteration
Select relevant information(Attention)
Attention:Select
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
Questions?
![Page 57: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/57.jpg)
Grid World Experiment
Success Rate VIN CNN FCN
8x8 99.6% 97.9% 97.3%
16x16 99.3% 87.6% 88.3%
28x28 97% 74.2% 76.6%
(DQN) (dense pixelwise classification)
![Page 58: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/58.jpg)
Grid World Experiment
Success Rate VIN CNN FCN
8x8 99.6% 97.9% 97.3%
16x16 99.3% 87.6% 88.3%
28x28 97% 74.2% 76.6%
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 59: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/59.jpg)
Grid World Experiment
Success Rate VIN CNN FCN
8x8 99.6% 97.9% 97.3%
16x16 99.3% 87.6% 88.3%
28x28 97% 74.2% 76.6%
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 60: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/60.jpg)
Grid World Experiment
Success Rate VIN CNN FCN
8x8 99.6% 97.9% 97.3%
16x16 99.3% 87.6% 88.3%
28x28 97% 74.2% 76.6%
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 61: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/61.jpg)
Grid World Experiment
Success Rate VIN CNN FCN
8x8 99.6% 97.9% 97.3%
16x16 99.3% 87.6% 88.3%
28x28 97% 74.2% 76.6%
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 62: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/62.jpg)
Mars Rover Experiment
Rover needs to avoid elevation angles greater than 10 degrees.Elevation needs to be inferred from the input image.
VINGT
Tamar, Aviv, et al. "Value iteration networks." NIPS. 2016.
![Page 63: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/63.jpg)
The Predictron: End-to-End Learning and
PlanningDavid Silver et. al
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 64: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/64.jpg)
Motivation
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 65: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/65.jpg)
Motivation
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 66: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/66.jpg)
Motivation
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 67: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/67.jpg)
Motivation
Current deep classification/regression nets cannot unfold into the future for making predictions
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 68: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/68.jpg)
Motivation
Predictron: An architecture for prediction tasks with inbuilt planning computation
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 69: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/69.jpg)
Imagine a Markov Reward Process with:
Architecture motivated by MRP
1. Initial state set as Input
2. Network for value of a state
3. Network for state transition
1-step Preturn:Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 70: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/70.jpg)
Imagine a Markov Reward Process with:
Architecture motivated by MRP
1. Initial state set as Input
2. Network for value of a state
3. Network for state transition
2-step Preturn:Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 71: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/71.jpg)
Inference
State Transition Network
m
s1
s0
r1 1
β0
1-step Preturn:
Value Network
v
s1
v1
![Page 72: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/72.jpg)
Inference
State Transition Network
m
s1
s0
r1 1
β0
State Transition Network
m
s2
s1
r2 2
β1
2-step Preturn:
Value Network
v
s2
v2
![Page 73: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/73.jpg)
Inference
k-step Predictron output is a Monte-Carlo estimate of expected
k-step Preturns
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 74: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/74.jpg)
Learning
Real Environment
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 75: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/75.jpg)
Experiments
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 76: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/76.jpg)
Experiments
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 77: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/77.jpg)
Experiments
Silver, David, et al. "The predictron: End-to-end learning and planning." arXiv:1612.08810 (2016).
![Page 78: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/78.jpg)
Summary● Policy Gradients
○ Stochastic; ○ Better properties; variance in training○ Maximize expected returns○ Actor-critic methods: Using value-networks to reduce variance○ A3C: parallel environments decorrelates training data
● Model-based learning○ Planning helps learning by modeling environment○ Dyna: new data from model○ Value Iteration Networks: generalization○ Predictron: reason about future
![Page 79: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/79.jpg)
Applications
![Page 80: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/80.jpg)
Recurrent Models of Visual AttentionVolodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu
![Page 81: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/81.jpg)
● Task: Classify digits in MNIST● Motivation: Full image convolution is expensive!
Motivation
● Humans focus attention selectively on parts of an image● Combine information from different fixations over time
![Page 82: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/82.jpg)
Overview
● Agent needs to learn a stochastic policy
○ Policy π is defined by the Location Network in the RNN
● True state of the environment is unobserved○ Glimpses can be seen as a partial view of the state
● State: ht = fh (ht−1 , gt ; θh )
● Actions:○ Location: lt ~ p(.|fl (ht ; θl ))
○ An environment action: at ~ p(.|fa (ht ; θa ))
● Reward: Cross-Entropy Loss
![Page 83: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/83.jpg)
● Retina-like representation ρ(xt , lt−1) ○ Contains multiple resolution patches
● Centered at location lt−1 of image xt
Glimpse
![Page 84: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/84.jpg)
● ρ(xt , lt−1) and lt-1 are mapped into a hidden space
Glimpse Network
![Page 85: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/85.jpg)
Model Architecture
Glimpse Network
Internal State
Environment Action
Location Action
Reward (at+1,gt)
![Page 86: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/86.jpg)
Training
![Page 87: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/87.jpg)
Training
● Parameters of the agent are: θ = {θg, θh, θa}○ Can be trained using standard backpropagation
● RL Objective: Maximize the reward given by: J(θ) = E[R] ○ Can maximize J(θ) using REINFORCE
![Page 88: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/88.jpg)
Results
![Page 89: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/89.jpg)
Results
![Page 90: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/90.jpg)
End-to-end Learning of Action Detection from Frame Glimpses in Videos
Serena Yeung, Olga Russakovsky, Greg Mori, Li Fei-Fei
![Page 91: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/91.jpg)
● Process of detecting actions is one of observation and refinement
● Task: Detect and classify moments in an untrimmed video● Motivation: Looking at all frames in a video is slow!
Motivation
![Page 92: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/92.jpg)
● Agent needs to learn a stochastic policy
○ Policy π is defined by the Location Network in the RNN
● True state of the environment is unobserved○ Observation Network can be seen as a partial view of
the state ● State: hn = fh (hn−1 , on ; θh )
● Actions:○ Candidate detection: dn=fd(hn;θd) ○ Binary indication: pn=fp(hn;θp)○ Temporal location: ln+1= fl(hn; θl )
● Reward:
Overview
![Page 93: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/93.jpg)
● Observes a single video frame at each timestep and encodes the frame and it’s location into a feature vector on○ Inspired by the Glimpse network
Observation Network
![Page 94: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/94.jpg)
Model Architecture
Observation Network
Internal State
Environment Action Location
Action
Reward
Binary Indicator Action
![Page 95: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/95.jpg)
● RL Objective: Maximize the reward given by: J(θ) = E[R] ○○ Can maximize J(θ) using REINFORCE
● Parameters of the agent are: θ = { θo, θh, θd }○ Can be trained using standard backpropagation
Training
![Page 96: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/96.jpg)
Results - I● THUMOS 14’ Dataset
○ Correct Predictions
![Page 97: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/97.jpg)
Results - II
● Key Takeaways: ○ Accuracy is comparable to state-of-the-art○ Less frames observed
![Page 98: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/98.jpg)
AlphaGo:A bit of everything
(but mostly plain PG + planning)https://www.youtube.com/watch?v=4D5yGiYe8p4
![Page 99: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/99.jpg)
Thanks!
![Page 100: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/100.jpg)
AlphaGo slides
![Page 101: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/101.jpg)
Background: Monte-Carlo Tree SearchAnother planning method.
● Sample future paths using stochastic policy ○ Biased towards reasonable moves○ The predictron paper may do this if they modeled the
environment ℙ( ′| , ).
(talk) D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. ICML Workshop 2016
![Page 102: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/102.jpg)
Background: Monte-Carlo Tree SearchDeterministic environment version.
1. Select path according to plus exploration
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
2. Expand leaf node (compute children and their ℙ(·))3. Evaluate ( ) by rolling out (play till the end)4. Backup: update ( , ) along the path (count)
![Page 103: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/103.jpg)
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo models overview
Policy gradient
![Page 104: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/104.jpg)
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo models
● Supervised learning○ On human expert moves
1-layer network(fast!)
13-layer CNN
○ One small (very fast rollout)■ 2㎲; 24.2% accuracy
○ One deeper■ 57% accuracy w/ handcrafted features; ■ 55.7% using only raw board + past move
![Page 105: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/105.jpg)
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo models
Importance ofclassification accuracy
(win rate against final AlphaGo)
1-layer network(fast!)
13-layer CNN
![Page 106: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/106.jpg)
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo models● Policy gradient
○ Improve SL policy to RL policy ■ Playing against its past iterations (less overfitting)
○ Training: PG w/o discount (rewards win = +1; lose= -1)● Wins 80% against SL policy
○ 85% to Pachi (open source s-o-t-a)○ Ranks ~ 3 amateur dan
Policy gradient
![Page 107: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/107.jpg)
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo models ● Value network: evaluate the win-rate of state
○ Use self-play instead of human moves(less overfit)
○ Under "optimal policy" (the RL one)○ David: "perhaps the key of AlphaGo dev."
■ (first strong state evaluator)
![Page 108: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/108.jpg)
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo models recap
Policy gradient
Fast policy
Human-like policy
"Optimal" policy
Value according to "optimal" policy
Any policy can play go
![Page 109: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/109.jpg)
Putting everything together w/ MCTSDeterministic environment version.
1. Select path by maximizing estimated Q and exploration u
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
2. Expand leaf node (compute children's ℙ(·) using )3. Evaluate ( ) by rolling out (using fast and value )4. Backup: update ( , ) along the path (using count)
Using Q and u estimated by
MCTS
Using human-like
Using a linear combination of fast and
value
![Page 110: Policy Gradients + Planning · Rajbir Kataria, Zhizhong Li, and Tanmay Gupta. Background Action-value function using parameters θ](https://reader034.vdocuments.us/reader034/viewer/2022042915/5f51528ce5f918157102bc1d/html5/thumbnails/110.jpg)
AlphaGo results
D. Silver. Mastering the game of Go with Deep Neural Networks and Tree Search. Nature, vol. 529 issue 7587
AlphaGo vs human or
related work
Ablation study of the
components
Ablation study of distributed
MCTS