algorithms and applications soft actor-critic
TRANSCRIPT
![Page 1: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/1.jpg)
CS391R: Robot Learning (Fall 2021)
Soft Actor-Critic Algorithms and Applications
1
Presenter: Zizhao Wang
10/12/2021
![Page 2: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/2.jpg)
CS391R: Robot Learning (Fall 2021) 2
MotivationSample Efficiency
Reinforcement Learning (RL) can solve complex decision making and control problems, but
it is notoriously sample-inefficient.
Vinyals et al., 2019
“During training, each agent experienced up to 200 years of real-time StarCraft play.”
Grandmaster level in StarCraft II using multi-agent reinforcement learning
![Page 3: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/3.jpg)
CS391R: Robot Learning (Fall 2021) 3
MotivationSample Efficiency
Reinforcement Learning (RL) can solve complex decision making and control problems, but
it is notoriously sample-inefficient.
Levine et al., 2016
● 14 robot arms learning to grasp in parallel
● Observing over 800,000 grasp attempts (3000 robot-hours of practice), we can see the beginnings of intelligent reactive behaviors.
Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection
![Page 4: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/4.jpg)
CS391R: Robot Learning (Fall 2021) 4
MotivationRobustness to Hyperparameters
RL performance is brittle to hyperparameters, which needs laborious tuning.
Henderson et al., 2017
Reward scaling has a significant impact on DDPG performance.(DDPG will be introduced soon)
Deep Reinforcement Learning that Matters
![Page 5: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/5.jpg)
CS391R: Robot Learning (Fall 2021) 5
Problem SettingMarkov Decision Process (MDP)
State Space
Action Space
Transition Probability
Reward
Policy
Trajectory
![Page 6: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/6.jpg)
CS391R: Robot Learning (Fall 2021) 6
Contextbe
tter s
ampl
e ef
ficie
ncy
RL Ranking on Sample Efficiency
● model-based
learns a model of the environment + use the learned model to imagine interactions
![Page 7: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/7.jpg)
CS391R: Robot Learning (Fall 2021) 7
Contextbe
tter s
ampl
e ef
ficie
ncy
off-policy RLRL Ranking on Sample Efficiency
● model-based
● model-free
○ off-policy
can learns from data generated by a
different policy
![Page 8: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/8.jpg)
CS391R: Robot Learning (Fall 2021) 8
Contextbe
tter s
ampl
e ef
ficie
ncy
off-policy RL
on-policy RL
RL Ranking on Sample Efficiency
● model-based
● model-free
○ off-policy
can learns from data generated by a
different policy
○ on-policy
must learn from data generated by the
current policy
![Page 9: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/9.jpg)
CS391R: Robot Learning (Fall 2021) 9
Contextbe
tter s
ampl
e ef
ficie
ncy
off-policy RL
on-policy RL
RL Ranking on Sample Efficiency
● model-based
● model-free
○ off-policy
can learns from data generated by a
different policy
○ on-policy
must learn from data generated by the
current policy
![Page 10: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/10.jpg)
CS391R: Robot Learning (Fall 2021) 10
ContextActor-Critic
● Critic: estimates future return of state-action pair
● Actor: adjusts policy to maximize critic’s estimated future
return
![Page 11: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/11.jpg)
CS391R: Robot Learning (Fall 2021) 11
ContextActor-Critic
● Critic:
● Actor:
Deep Deterministic Policy Gradient (DDPG)
● Critic
● Actor
![Page 12: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/12.jpg)
CS391R: Robot Learning (Fall 2021) 12
ContextActor-Critic
● Critic:
● Actor:
Deep Deterministic Policy Gradient (DDPG)
● Critic
● Actor
● Deterministic policy makes training unstable and brittle
to hyperparameters.
![Page 13: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/13.jpg)
CS391R: Robot Learning (Fall 2021) 13
ContextStandard RL
Reward:
Maximum Entropy RL
reward:
![Page 14: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/14.jpg)
CS391R: Robot Learning (Fall 2021) 14
ContextStandard RL
Reward:
Entropy : measure of uncertainty
Maximum Entropy RL
reward:
![Page 15: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/15.jpg)
CS391R: Robot Learning (Fall 2021) 15
ContextMaximum Entropy RL
reward:
optimal policy:
Standard RL
reward:
optimal policy:
![Page 16: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/16.jpg)
CS391R: Robot Learning (Fall 2021) 16
ContextMaximum Entropy RL
reward:
optimal policy:
Standard RL
reward:
optimal policy:
Why is it helpful?● encourages exploration● enables multi-modal action selection
![Page 17: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/17.jpg)
CS391R: Robot Learning (Fall 2021) 17
ContextMaximum Entropy RL
reward:
optimal policy:
Standard RL
reward:
optimal policy:
Why is it helpful?● encourages exploration● enables multi-modal action selection
![Page 18: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/18.jpg)
CS391R: Robot Learning (Fall 2021) 18
MethodSoft Actor-Critic
reward:
Q-value:
Actor-Critic (DDPG)
reward:
Q-value:
![Page 19: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/19.jpg)
CS391R: Robot Learning (Fall 2021) 19
MethodSoft Actor-Critic
reward:
Q-value:
Q-value updates:
Actor-Critic (DDPG)
reward:
Q-value:
Q-value updates:
![Page 20: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/20.jpg)
CS391R: Robot Learning (Fall 2021) 20
MethodSoft Actor-Critic
reward:
Q-value:
Q-value updates:
policy improvement:
Actor-Critic (DDPG)
reward:
Q-value:
Q-value updates:
policy improvement:
![Page 21: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/21.jpg)
CS391R: Robot Learning (Fall 2021) 21
MethodSoft Actor-CriticActor-Critic (DDPG)
![Page 22: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/22.jpg)
CS391R: Robot Learning (Fall 2021) 22
MethodProof of Convergence
Assumption: finite state and action space
1. If we update Q-value as follows, it will converge to as .
![Page 23: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/23.jpg)
CS391R: Robot Learning (Fall 2021) 23
MethodProof of Convergence
Assumption: finite state and action space
1. If we update Q-value as follows, it will converge to as .
2. If we update the policy as follows, then .
![Page 24: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/24.jpg)
CS391R: Robot Learning (Fall 2021) 24
MethodProof of Convergence
Assumption: finite state and action space
1. If we update Q-value as follows, it will converge to as .
2. If we update the policy as follows, then .
3. If we repeat step 1 and 2, we will find the optimal policy such that for
any policy and .
![Page 25: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/25.jpg)
CS391R: Robot Learning (Fall 2021) 25
MethodProof of Convergence: Are they realistic?
Assumption: finite state and action space
1. If we update Q-value as follows, it will converge to as .
2. If we update the policy as follows, then .
3. If we repeat step 1 and 2, we will find the optimal policy such that for
any policy and .
![Page 26: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/26.jpg)
CS391R: Robot Learning (Fall 2021) 26
MethodProof of Convergence: Are they realistic?
Assumption: finite state and action space
1. If we update Q-value as follows, it will converge to as .
2. If we update the policy as follows, then .
3. If we repeat step 1 and 2, we will find the optimal policy such that for
any policy and . For how many step? Possibly a lot.
![Page 27: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/27.jpg)
CS391R: Robot Learning (Fall 2021) 27
MethodProof of Convergence: Are they realistic?
Assumption: finite state and action space
1. If we update Q-value as follows, it will converge to as .
2. If we update the policy as follows, then .
3. If we repeat step 1 and 2, we will find the optimal policy such that for
any policy and . For how many step? Possibly a lot.
Not applicable to many robotics problems (continuous state/action) and not computationally tractable.
![Page 28: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/28.jpg)
CS391R: Robot Learning (Fall 2021) 28
MethodImplementation (Practical Approximation)
1. critic training
● convergence → gradient descent
![Page 29: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/29.jpg)
CS391R: Robot Learning (Fall 2021) 29
MethodImplementation (Practical Approximation)
1. critic training
● convergence → gradient descent
● remove expectation
○ next action
![Page 30: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/30.jpg)
CS391R: Robot Learning (Fall 2021) 30
MethodImplementation (Practical Approximation)
1. critic training
● convergence → gradient descent
● remove expectation
○ next action
○ entropy
![Page 31: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/31.jpg)
CS391R: Robot Learning (Fall 2021) 31
MethodImplementation (Practical Approximation)
1. critic training
● convergence → gradient descent
● remove expectation
○ next action
○ entropy
○ next state
![Page 32: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/32.jpg)
CS391R: Robot Learning (Fall 2021) 32
MethodImplementation (Practical Approximation)
2. actor
training
● convergence → gradient descent
● rewrite KL divergence
![Page 33: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/33.jpg)
CS391R: Robot Learning (Fall 2021) 33
MethodImplementation (Practical Approximation)
2. actor
training
● convergence → gradient descent
● rewrite KL divergence
![Page 34: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/34.jpg)
CS391R: Robot Learning (Fall 2021) 34
MethodImplementation (Practical Approximation)
2. actor
training
● convergence → gradient descent
● rewrite KL divergence
● scale loss by and omit constant
![Page 35: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/35.jpg)
CS391R: Robot Learning (Fall 2021) 35
MethodImplementation (Practical Approximation)
2. actor
training
● convergence → gradient descent
● rewrite KL divergence
● scale loss by and omit constant
● remove expectation over action
![Page 36: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/36.jpg)
CS391R: Robot Learning (Fall 2021)
2. actor
training: gradient descent with
design choice: policy as normal distribution
● but normal distribution is unimodal, it loses the declared multi-modal advantages.
36
MethodImplementation (Practical Approximation)
![Page 37: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/37.jpg)
CS391R: Robot Learning (Fall 2021) 37
MethodAutomatic Entropy Adjustment
Choosing the optimal is not trivial
● Recall for maximum entropy RL, the reward is .
![Page 38: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/38.jpg)
CS391R: Robot Learning (Fall 2021) 38
MethodAutomatic Entropy Adjustment
Choosing the optimal is not trivial
● Recall for maximum entropy RL, the reward is .
● So depends on the magnitude of , but can vary a lot across
○ different tasks
○ different policies as the policy gets improved
![Page 39: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/39.jpg)
CS391R: Robot Learning (Fall 2021) 39
MethodAutomatic Entropy Adjustment
Choosing the optimal is not trivial
● Recall for maximum entropy RL, the reward is .
● So depends on the magnitude of , but can vary a lot across
○ different tasks
○ different policies as the policy gets improved
● Ideal policy entropy should be
○ stochastic enough for exploration in uncertain regions
○ deterministic enough for performance in learned regions
![Page 40: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/40.jpg)
CS391R: Robot Learning (Fall 2021) 40
MethodAutomatic Entropy Adjustment
Choosing the optimal is not trivial
● Recall for maximum entropy RL, the reward is .
● So depends on the magnitude of , but can vary a lot across
○ different tasks
○ different policies as the policy gets improved
● Ideal policy entropy should be
○ stochastic enough for exploration in uncertain regions
○ deterministic enough for performance in learned regions
● Only constrain the average entropy across states
![Page 41: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/41.jpg)
CS391R: Robot Learning (Fall 2021) 41
MethodAutomatic Entropy Adjustment
Only constrain the average entropy across states
● : target entropy, always > 0
● increase if , decrease otherwise.
![Page 42: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/42.jpg)
CS391R: Robot Learning (Fall 2021) 42
MethodOverall Algorithm
![Page 43: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/43.jpg)
CS391R: Robot Learning (Fall 2021) 43
Simulated Benchmarks
State: joint value
Action: joint torque
Metric: average return
Ant Half Cheetah Walker Hopper Humanoid
Experiment
![Page 44: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/44.jpg)
CS391R: Robot Learning (Fall 2021) 44
Baselines
● SAC with learned
● SAC with fixed
● DDPG (off-policy, deterministic policy)
● TD3 (DDPG with engineering improvements)
● PPO (on-policy)
Experiment
![Page 45: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/45.jpg)
CS391R: Robot Learning (Fall 2021) 45
Baselines
● SAC with learned
● SAC with fixed
● DDPG (off-policy, deterministic policy)
● TD3 (DDPG with engineering improvements)
● PPO (on-policy)
Hypotheses
Compared to baselines, if SAC has better
● sample-efficiency: learning speed + final performance
● stability: performance on hard tasks where hyperparameters tuning is challenging
Experiment
![Page 46: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/46.jpg)
CS391R: Robot Learning (Fall 2021) 46
Experiment
![Page 47: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/47.jpg)
CS391R: Robot Learning (Fall 2021) 47
Simulated Benchmarks● Easy tasks (hopper, walker)
○ all algorithm performs comparably except for DDPG
ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO
![Page 48: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/48.jpg)
CS391R: Robot Learning (Fall 2021) 48
Simulated Benchmarks● Normal tasks (half cheetah, ant)
○ SAC > TD3 > DDPG & PPO in both learning speed and final performance
ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO
![Page 49: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/49.jpg)
CS391R: Robot Learning (Fall 2021) 49
Simulated Benchmarks● Hard tasks (humanoid)
○ DDPG & its variant TD3 fail to learn
○ SAC learns much faster than PPO
ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO
![Page 50: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/50.jpg)
CS391R: Robot Learning (Fall 2021) 50
Simulated Benchmarks● Larger variance with the automatic temperature adjustment
ExperimentSAC (learned temperature)SAC (fixed temperature)DDPGTD3PPO
![Page 51: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/51.jpg)
CS391R: Robot Learning (Fall 2021) 51
Real World Quadrupedal Locomotion
State: low-dimensional
Challenges: sample efficiency & generalization to unseen environments
Training: 160k steps (2 hours)
Experiment
![Page 52: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/52.jpg)
CS391R: Robot Learning (Fall 2021) 52
Real World Quadrupedal Locomotion
Train on Flat
Test on Slope
Test with Obstacles
Test with Stairs
Experiment
![Page 53: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/53.jpg)
CS391R: Robot Learning (Fall 2021) 53
ExperimentReal World Manipulation
State: hand joint angles + image / ground truth valve position
Challenges: precepting the valve position from images
Comparison (using ground truth valve position): SAC (3 hrs) vs PPO (7.4 hrs)
![Page 54: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/54.jpg)
CS391R: Robot Learning (Fall 2021) 54
Limitations● Multi-modal policy
○ Though it is declared that maximum entropy RL can benefit from multi-modal policy, SAC
chooses to use a unimodal policy (normal distribution).
○ All experiments don’t require multi-modal behaviors to finish.
● Hyperparameter tuning
○ Target entropy brings a new hyperparameter to tune.
○ Average entropy constraints do not provide the desired exploration + exploitation balance.
○ For manipulation tasks that require accurate control, even averaged entropy regularization still
hurts the performance.
![Page 55: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/55.jpg)
CS391R: Robot Learning (Fall 2021) 55
Future Work and Extended Readings● Learn multi-modal policy rather than unimodal policy
○ Reinforcement Learning with Deep Energy-Based Policies
● Improve sample-efficiency with auxiliary tasks
○ Improving Sample Efficiency in Model-Free Reinforcement Learning from Images
○ CURL: Contrastive Unsupervised Representations for Reinforcement Learning
● Improve generalizability by learning tasks-relevant features
○ Learning Invariant Representations for Reinforcement Learning without Reconstruction
○ Learning Task Informed Abstractions
![Page 56: Algorithms and Applications Soft Actor-Critic](https://reader033.vdocuments.us/reader033/viewer/2022060317/62949093bd6a966c383d23ec/html5/thumbnails/56.jpg)
CS391R: Robot Learning (Fall 2021) 56
Summary● Problem: sample-efficient RL with automatic hyperparameter tuning
● Limitations of prior work:
○ On-policy RL is sample-inefficient
○ DDPG that uses deterministic policy is brittle to hyperparameters
● Key insights of the proposed work
○ Maximal entropy RL encourages exploration and robust to environments &
hyperparameters.
○ Use the average entropy across states as regularization.
● State-of-the-art sample efficiency and generalizabilities on simulated and real world tasks.