![Page 1: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/1.jpg)
Sim2Real
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie MellonSchool of Computer Science
![Page 2: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/2.jpg)
So far
The requirement of large number of samples for RL, only possible in simulation, renders RL a model-based framework, we can’t really rely (solely) on interaction in the real world (as of today)
• In the real world, we usually finetune model and policies learnt in simulation
![Page 3: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/3.jpg)
Physics Simulators
The requirement of large number of samples for RL, only possible in simulation, renders RL a model-based framework, we can’t really rely (solely) on interaction in the real world (as of today)
Mujoko, bullet, gazeebo, etc.
![Page 4: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/4.jpg)
Pros of Simulation
• We can afford many more samples!
• Safety
• Avoids wear and tear of the robot
• Good at rigid multibody dynamics
![Page 5: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/5.jpg)
Cons of Simulation
• Under-modeling: many physical events are not modeled.
• Wrong parameters. Even if our physical equations were correct, we would need to estimate the right parameters, e.g., inertia, frictions (system identification).
• Systematic discrepancy w.r.t. the real world regarding:
• observations
• dynamics
as a result, policies that learnt in simulation do not transfer to the real world
• Hard to simulate deformable objects (finite element methods are very computational intensive)
![Page 6: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/6.jpg)
• Domain randomization (dynamics, images)• With enough variability in the simulator, the real world may appear
to the model as just another variation”• Learning not from pixels but rather from label maps-> semantic maps
between simulation and real world are closer than textures• Learning higher level policies, not low-level controllers, as the low level
dynamics are very different between Sim and REAL
What has shown to work
![Page 7: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/7.jpg)
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Tobin et al., 2017 arXiv:1703.06907
Domain randomizationfor detecting and grasping objects
![Page 8: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/8.jpg)
Cuboid Pose Estimation
Let’s try a more fine grained task
![Page 9: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/9.jpg)
Data Generation
Data generation
![Page 10: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/10.jpg)
Data Generation
Data generation
![Page 11: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/11.jpg)
Model Output - Belief Maps
0
1
2
3
4
5
6
Regressing to vertices
![Page 12: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/12.jpg)
Baxter’s camera
SIM2REAL
![Page 13: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/13.jpg)
Data - Contrast and Brightness
Data generation
![Page 14: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/14.jpg)
Baxter’s camera
SIM2REAL
![Page 15: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/15.jpg)
Surprising Result
SIM2REAL
![Page 16: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/16.jpg)
Baxter’s camera
SIM2REAL
![Page 17: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/17.jpg)
Car detection
Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization, NVIDIA
VKITTI
domain rand
data generation
![Page 18: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/18.jpg)
Dynamics randomization
![Page 19: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/19.jpg)
Ideas:• Consider a distribution over simulation models instead of a single one
for learning policies robust to modeling errors that work well under many ``worlds”. Hard model mining
• Progressively bring the simulation model distribution closer to the real world.
![Page 20: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/20.jpg)
Policy Search under model distribution
p: simulator parameters
Learn a policy that performs best in expectation over MDPs in the source domain distribution:
![Page 21: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/21.jpg)
Policy Search under model distribution
p: simulator parameters
Learn a policy that performs best in expectation over MDPs in the source domain distribution:
Learn a policy that performs best in expectation over the worst \epsilon-percentile of MDPs in the source domain distribution
Hard world model mining
![Page 22: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/22.jpg)
Hard model mining
![Page 23: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/23.jpg)
Hard model mining results
Hard world mining results in policies with high reward over wider range of parameters
![Page 24: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/24.jpg)
Adapting the source domain distribution
Sample a set of simulation parameters from a sampling distribution S.Posterior of parameters p_i:
Fit a Gaussian model over simulator parameters based on posterior weights of the samples
fit of simulation parameter samples: how probable is an observed target state-action trajectory, the more probable the more we prefer such simulation model
![Page 25: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/25.jpg)
Source Distribution Adaptation
![Page 26: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/26.jpg)
Performance on hopper policiestrained on Gaussian distribution of mean mass 6 and standard deviation 1.5trained on single source domains
![Page 27: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/27.jpg)
Idea: the driving policy is not directly exposed to raw perceptual input or low-level vehicle dynamics.
![Page 28: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/28.jpg)
Main ideapixels to steering wheel learning is not SIM2REAL transferable
• textures/car dynamics mismatch
label maps to waypoint learning is SIM2REAL transferable• label maps are similar between SIM and REAL and a low-level controller
will take the car from waypoint to waypoint
![Page 29: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/29.jpg)
![Page 30: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/30.jpg)
Maximum Entropy Reinforcement Learning
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie MellonSchool of Computer Science
CMU 10703
Parts of slides borrowed from Russ Salakhutdinov, Rich Sutton, David Silver
![Page 31: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/31.jpg)
RL objective
π* = arg maxπ
𝔼π [∑t
R(st, at)]
![Page 32: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/32.jpg)
MaxEntRL objective
π* = arg maxπ
𝔼π
T
∑t=1
R(st, at)
reward
+α H(π( ⋅ |st))
entropy
Why?
• Better exploration
• Learning alternative ways of accomplishing the task
• Better generalization, e.g., in the presence of obstacles a stochastic policy may still succeed.
Promoting stochastic policies
![Page 33: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/33.jpg)
Principle of Maximum Entropy
Haarnoja et al., Reinforcement Learning with Deep Energy-Based Policies
Policies that generate similar rewards, should be equally probable.
We do not want to commit to one policy over the other.
Why?
• Better exploration
• Learning alternative ways of accomplishing the task
• Better generalization, e.g., in the presence of obstacles a stochastic policy may still succeed.
![Page 34: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/34.jpg)
dθ ← dθ + ∇θ′ �log π(ai |si; θ′�)(R − V(si; θ′�v)+β ∇θ′�H(π(st; θ′�)))
Mnih et al., Asynchronous Methods for Deep Reinforcement Learning
“We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies. This technique was originally proposed by (Williams & Peng, 1991)”
![Page 35: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/35.jpg)
dθ ← dθ + ∇θ′ �log π(ai |si; θ′�)(R − V(si; θ′�v)+β ∇θ′�H(π(st; θ′�)))
Mnih et al., Asynchronous Methods for Deep Reinforcement Learning
This is just a regularization: such gradient just maximizes entropy of the current time step, not of future timesteps.
![Page 36: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/36.jpg)
MaxEntRL objective
π* = arg maxπ
𝔼π
T
∑t=1
R(st, at)
reward
+α H(π( ⋅ |st))
entropy
How can we maximize such an objective?
Promoting stochastic policies
![Page 37: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/37.jpg)
qπ(s, a) = r(s, a) + γ ∑s′�∈𝒮
T(s′�|s, a) ∑a′�∈𝒜
π(a′�|s′�)qπ(s′�, a′�)
Recall:Back-up Diagrams
![Page 38: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/38.jpg)
Back-up Diagrams for MaxEnt Objective
H(π( ⋅ |s′ �)) = − 𝔼a log π(a′�|s′�)
π* = arg maxπ
𝔼π
T
∑t=1
R(st, at)
reward
+α H(π( ⋅ |st))
entropy
![Page 39: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/39.jpg)
Back-up Diagrams for MaxEnt Objective
−log π(a′ �|s′�)
qπ(s, a) = r(s, a) + ∑a′ �,s′�
p(s′�, r |s, a′�)(qπ(s′�, a′�)−log(π(a′�|s′�))
qπ(s, a) = r(s, a) + γ ∑s′�∈𝒮
T(s′�|s, a) ∑a′�∈𝒜
π(a′�|s′�)(qπ(s′�, a′�)−log(π(a′�|s′�))
π* = arg maxπ
𝔼π
T
∑t=1
R(st, at)
reward
+α H(π( ⋅ |st))
entropy
![Page 40: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/40.jpg)
(Soft) policy evaluation
qπ(s, a) = r(s, a) + γ∑s′�
T(s′�|s, a′�)∑a′�
π(a′�|s′�)(qπ(s′�, a′�)−log(π(a′�|s′�))Bellman backup equation:
Soft Bellman backup equation:
qπ(s, a) = r(s, a) + γ ∑s′�∈𝒮
T(s′�|s, a) ∑a′�∈𝒜
π(a′�|s′�)qπ(s′�, a′�)
Q(st, at) ← r(st, at) + γ𝔼st+1,at+1 [Q(st+1, at+1)−log π(at+1 |st+1)]]
Q(st, at) ← r(st, at) + γ𝔼st+1,at+1Q(st+1, at+1)
Bellman backup update operator-unknown dynamics:
Soft Bellman backup update operator-unknown dynamics:
![Page 41: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/41.jpg)
Soft Bellman backup update operator is a contraction
Q(st, at) ← r(st, at) + γ𝔼st+1,at+1 [Q(st+1, at+1)−log π(at+1 |st+1)]]
Q(st, at) ← r(st, at) + γ𝔼st+1∼ρ[𝔼at+1∼π[Q(st+1, at+1) − log π(at+1 |st+1)]]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρ𝔼at+1∼π[−log π(at+1 |st+1)]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρH(π( ⋅ |st+1))
Q(st, at) ← r(st, at) + γ𝔼st+1∼ρ[𝔼at+1∼π[Q(st+1, at+1) − log π(at+1 |st+1)]]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρ𝔼at+1∼π[−log π(at+1 |st+1)]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρH(π( ⋅ |st+1))
rsoft(st, at) = r(st, at) + γ𝔼st+1∼ρH(π( ⋅ |st+1))
Rewrite the reward as:
Then we get the old Bellman operator, which we know is a contraction
![Page 42: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/42.jpg)
Soft Bellman backup update operatorQ(st, at) ← r(st, at) + γ𝔼st+1,at+1 [Q(st+1, at+1)−log π(at+1 |st+1)]]
Q(st, at) ← r(st, at) + γ𝔼st+1∼ρ[𝔼at+1∼π[Q(st+1, at+1) − log π(at+1 |st+1)]]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρ𝔼at+1∼π[−log π(at+1 |st+1)]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρH(π( ⋅ |st+1))
Q(st, at) ← r(st, at) + γ𝔼st+1∼ρ[𝔼at+1∼π[Q(st+1, at+1) − log π(at+1 |st+1)]]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρ𝔼at+1∼π[−log π(at+1 |st+1)]← r(st, at) + γ𝔼st+1∼ρ,at+1∼πQ(st+1, at+1) + γ𝔼st+1∼ρH(π( ⋅ |st+1))
Q(st, at) ← r(st, at) + γ𝔼st+1∼ρ[V(st+1)]We know that:
V(st) = 𝔼at∼π[Q(st, at) − log π(at |st)]
Which means that:
![Page 43: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/43.jpg)
Soft Policy Iteration
Soft policy iteration iterates between two steps:
1. Soft policy evaluation: Fix policy, apply Bellman backup operator till convergence
This converges to qπ
qπ(s, a) = r(s, a) + 𝔼s′�,a′ �(qπ(s′�, a′�)−log(π(a′ �|s′�))
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
π′� = arg minπk∈Π
DKL (πk( ⋅ |st) | |exp(Qπ(st, ⋅ ))
Zπ(st) )
2. Soft policy improvement: Update the policy:
![Page 44: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/44.jpg)
SoftMax
![Page 45: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/45.jpg)
Soft Policy Iteration
Leads to a sequence of policies with monotonically increasing soft q values
Soft policy iteration iterates between two steps:
1. Soft policy evaluation: Fix policy, apply Bellman backup operator till convergence
This converges to qπ
qπ(s, a) = r(s, a) + 𝔼s′�,a′ �(qπ(s′�, a′�)−log(π(a′ �|s′�))
This so far concerns tabular methods. Next we will use function approximations for policy and action values
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
π′� = arg minπk∈Π
DKL (πk( ⋅ |st) | |exp(Qπ(st, ⋅ ))
Zπ(st) )
2. Soft policy improvement: Update the policy:
![Page 46: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/46.jpg)
Soft Policy Iteration - ApproximationUse function approximations for policy, state and action value functions
Vψ(st) Qθ(st) πϕ(at |st)
1. Learning the state value function:
![Page 47: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/47.jpg)
Soft Policy Iteration - ApproximationUse function approximations for policy, state and action value functions
Vψ(st) Qθ(st) πϕ(at |st)
2. Learning the state-action value function:
![Page 48: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/48.jpg)
Soft Policy Iteration - ApproximationUse function approximations for policy, state and action value functions
Vψ(st) Qθ(st, at) πϕ(at |st)
3. Learning the policy:
∇ϕJπ(ϕ) = ∇ϕ𝔼st∈D𝔼at∼πϕ(a|st) logπϕ(at |st)
exp(Qθ(st, at))Zθ(st)
The variable w.r.t. which we take gradient parametrizes the distribution inside the distribution.
∇ϕJπ(ϕ) = ∇ϕ𝔼st∈D𝔼a∼πϕ(⋅|st) logπϕ( ⋅ |st)
exp(Qπ(st, ⋅ ))Zπ(st)
Zθ(st) = ∫𝒜exp(Qθ(st, at))dat
independent of \phi
∇ϕJπ(ϕ) = ∇ϕ𝔼st∈D𝔼at∼πϕ(a|st) logπϕ(at |st)
exp(Qθ(st, at))
![Page 49: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/49.jpg)
Soft Policy Iteration - ApproximationUse function approximations for policy, state and action value functions
Vψ(st) Qθ(st) πϕ(at |st)
3. Learning the policy:
∇ϕJπ(ϕ) = ∇ϕ𝔼st∈D𝔼a∼πϕ(⋅|st) logπϕ( ⋅ |st)
exp(Qπ(st, ⋅ ))Zπ(st)
Reparametrization trick. The policy becomes a deterministic function of Gaussian random variables (fixed Gaussian distribution):
at = fϕ(st, ϵ) = μϕ(st) + ϵΣϕ(st), ϵ ∼ 𝒩(0,I)
∇ϕJπ(ϕ) = ∇ϕ𝔼st∈D𝔼at∼πϕ(a|st) logπϕ(at |st)
exp(Qθ(st, at))
∇ϕJπ(ϕ) = ∇ϕ𝔼st∈D,ϵ∼𝒩(0,I) logπϕ(at |st)
exp(Qθ(st, at))
![Page 50: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/50.jpg)
Soft Policy Iteration - Approximation
![Page 51: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/51.jpg)
![Page 52: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/52.jpg)
Composability of Maximum Entropy Policies
Composable Deep Reinforcement Learning for Robotic Manipulation, Haarnoja et al.
Imagine we want to satisfy two objectives at the same time, e.g., pick an object up while avoiding an obstacle. We would learn a policy to maximize the addition of the the corresponding reward functions:
MaxEnt policies permit to obtain the resulting policy’s optimal Q by simply adding the constituent Qs:
We can theoretically bound the suboptimality of the resulting policy w.r.t. the policy trained under the addition of rewards. We cannot do this for deterministic policies.
![Page 53: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/53.jpg)
Composability of Maximum Entropy Policies
Composable Deep Reinforcement Learning for Robotic Manipulation, Haarnoja et al.
![Page 54: Sim2Real - Carnegie Mellon UniversityCons of Simulation • Under-modeling: many physical events are not modeled. • Wrong parameters. Even if our physical equations were correct,](https://reader035.vdocuments.us/reader035/viewer/2022071213/603380af77bc2431947d72cc/html5/thumbnails/54.jpg)
https://youtu.be/wdexoLS2cWU