maximum margin planning - cornell university · problem specific qp-formulation min ,𝜉𝑖 2 2+...
TRANSCRIPT
![Page 1: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/1.jpg)
Maximum Margin Planning
Presenters:
Ashesh Jain, Michael Hu
CS6784 Class Presentation
Nathan Ratliff, Drew Bagnell and Martin Zinkevich
![Page 2: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/2.jpg)
Jain, Hu
Theme
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
– Reward based learning
Inverse Reinforcement Learning
![Page 3: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/3.jpg)
Jain, Hu
Motivating Example (Abbeel et. al)
![Page 4: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/4.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space
– Activity
Introducing Inverse RL – Max-margin formulation
– Optimization algorithms
– Activity
Conclusion
![Page 5: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/5.jpg)
Jain, Hu
Reward-based Learning
![Page 6: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/6.jpg)
Jain, Hu
Markov Decision Process (MDP)
MDP is defined by (X,A, 𝑝, 𝑅)
X State space
A Action space
𝑝 Dynamics model 𝑝(𝑥′|𝑥, 𝑎)
𝑅 Reward function
𝑅(𝑥) or 𝑅(𝑥, 𝑎)
𝑋
𝑌 0.7
0.1
0.1
0.1
100 -1
-1 …
…
…
![Page 7: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/7.jpg)
Jain, Hu
MDP is defined by (X,A, 𝑝, 𝑅)
X State space
A Action space
𝑝 Dynamics model 𝑝(𝑥′|𝑥, 𝑎)
𝑅 Reward function
𝑅(𝑥) or 𝑅(𝑥, 𝑎)
𝑋
𝑌 0.7
0.1
0.1
0.1
100 -1
-1 …
…
…
Markov Decision Process (MDP)
Policy
𝜋:X →A
Example policy
![Page 8: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/8.jpg)
Jain, Hu
𝑥𝑡 𝑥𝑡+1
𝑎𝑡 𝑎𝑡+1
𝑅𝑡+1 𝑅𝑡
𝜋∗ = argmax𝜋𝐸 𝛾𝑡𝑅𝑡(𝑥𝑡 , 𝜋(𝑥𝑡))
∞
𝑡=0
𝜋:X →A
Goal: Learn an optimal policy
Graphical representation of MDP
𝑥𝑡 , 𝜋 𝑥𝑡 → 𝑥𝑡+1, 𝜋 𝑥𝑡+1 →⋅⋅⋅
![Page 9: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/9.jpg)
Jain, Hu
Activity – Very Easy!!!
Draw the optimal Policy!
Deterministic Dynamics
1
0
0
0
![Page 10: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/10.jpg)
Jain, Hu
Activity – Very Easy!!!
![Page 11: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/11.jpg)
Jain, Hu
Activity – Solution
![Page 12: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/12.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space
– Activity
Introducing Inverse RL – Max-margin formulation
– Optimization algorithms
– Activities
Conclusion
![Page 13: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/13.jpg)
Jain, Hu
Optimal Policy 𝜋∗ 𝜋: 𝑥 → 𝑎
Optimal Policy 𝜋∗ 𝜋: 𝑥 → 𝑎
Reward Function R(𝑥, 𝑎)
Reward Function R(𝑥, 𝑎)
RL to Inverse-RL (IRL)
RL Algorithm Dynamics Model 𝑝(𝑥′|𝑥, 𝑎)
• State space: X • Action space:A
Inverse problem: Given optimal policy 𝜋∗ or samples drawn from it, recover the reward R(𝑥, 𝑎)
Inverse
![Page 14: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/14.jpg)
Jain, Hu
Reward Function R(𝑥, 𝑎)
Why Inverse-RL?
RL Algorithm
• State space: X • Action space:A
Abbeel et. al. Ratliff et. al. Kober et. al.
Specifying R(𝑥, 𝑎) is hard, but samples from 𝜋∗ are easily available
![Page 15: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/15.jpg)
Jain, Hu
IRL framework
Expert
𝜋∗: 𝑥 → 𝑎
Interacts
Demonstration
𝑦∗ = 𝑥1, 𝑎1 → 𝑥2, 𝑎2 → 𝑥3, 𝑎3 → ⋯ → 𝑥𝑛, 𝑎𝑛
…… + + + + 𝑓 𝑦∗ = 𝑤𝑇 𝑤𝑇 𝑤𝑇 𝑤𝑇 𝑤𝑇
Reward
![Page 16: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/16.jpg)
Jain, Hu
Paper contributions
1. Formulated IRL as structural prediction
Maximum margin approach
2. Two optimization algorithms
Problem specific solver
Sub-gradient method
3. Robotic application: 2D navigation etc.
![Page 17: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/17.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space
– Activity
Introducing Inverse RL – Max-margin formulation
– Optimization algorithms
– Activities
Conclusion
![Page 18: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/18.jpg)
Jain, Hu
Formulation
Input X𝑖 ,A𝑖 , 𝑝𝑖 , 𝑓𝑖 , 𝑦𝑖 ,L𝑖 𝑖=1𝑛
X𝑖 State space
A𝑖 Action space
𝑝𝑖 Dynamics model
𝑦𝑖 Expert demonstration
𝑓𝑖(⋅) Feature function
L𝑖(⋅, 𝑦𝑖) Loss function
𝑋
𝑌 0.7
0.1
0.1
0.1
![Page 19: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/19.jpg)
Jain, Hu
X |A|
X |A|
Max-margin formulation
min𝑤,𝜉𝑖
𝜆
2𝑤 2 +
1
𝑛 𝛽𝑖𝜉𝑖𝑖
𝑠. 𝑡. ∀𝑖 𝑤𝑇𝑓𝑖 𝑦𝑖 ≥ max𝑦∈Y
𝑖
𝑤𝑇𝑓𝑖 𝑦 + L 𝑦, 𝑦𝑖 − 𝜉𝑖
Features linearly decompose over path 𝑓𝑖 𝑦 = 𝐹𝑖𝜇
𝑑 𝐹𝑖 = 𝜇 =
Feature map
Frequency counts
![Page 20: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/20.jpg)
Jain, Hu
Max-margin formulation
min𝑤,𝜉𝑖
𝜆
2𝑤 2 +
1
𝑛 𝛽𝑖𝜉𝑖𝑖
𝑠. 𝑡. ∀𝑖 𝑤𝑇𝐹𝑖𝜇𝑖 ≥ max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝜉𝑖
satisfies Bellman-flow constraints 𝜇
𝜇𝑥,𝑎𝑝𝑖 𝑥′ 𝑥, 𝑎 + 𝑠𝑖
𝑥′ = 𝜇𝑥′,𝑎
𝑎𝑥,𝑎
∀𝑥′
Outflow Inflow
![Page 21: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/21.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space – Activity
Introducing Inverse RL – Max-margin formulation – Optimization algorithms
• Problem specific QP-formulation • Sub-gradient method
– Activities
Conclusion
![Page 22: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/22.jpg)
Jain, Hu
𝜇𝑥,𝑎𝑝𝑖 𝑥′ 𝑥, 𝑎 + 𝑠𝑖
𝑥′ = 𝜇𝑥′,𝑎
𝑎𝑥,𝑎
∀𝑖, 𝑥, 𝑎 𝑣𝑖𝑥 ≥ 𝑤𝑇𝐹𝑖 + 𝑙𝑖
𝑥,𝑎 + 𝑝𝑖 𝑥′ 𝑥, 𝑎 𝑣𝑖
𝑥′
𝑥′
Problem specific QP-formulation
min𝑤,𝜉𝑖
𝜆
2𝑤 2 +
1
𝑛 𝛽𝑖𝜉𝑖𝑖
𝑠. 𝑡. ∀𝑖 𝑤𝑇𝐹𝑖𝜇𝑖 ≥ max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝜉𝑖
∀𝑥′
Outflow Inflow
𝑠. 𝑡. ∀𝑖 𝑤𝑇𝐹𝑖𝜇𝑖 ≥ 𝑠𝑖𝑇𝑣𝑖 − 𝜉𝑖
Can be optimized using off-the-shelf QP solvers
![Page 23: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/23.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space – Activity
Introducing Inverse RL – Max-margin formulation – Optimization algorithms
• Problem specific QP-formulation • Sub-gradient method
– Activities
Conclusion
![Page 24: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/24.jpg)
Jain, Hu
Optimizing via Sub-gradient
min𝑤,𝜉𝑖
𝜆
2𝑤 2 +
1
𝑛 𝛽𝑖𝜉𝑖𝑖
𝑠. 𝑡. ∀𝑖 𝑤𝑇𝐹𝑖𝜇𝑖 ≥ max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝜉𝑖
satisfies Bellman-flow constraints 𝜇
𝜉𝑖 ≥ max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝑤𝑇𝐹𝑖𝜇𝑖 ≥ 0
𝜉𝑖 = max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝑤𝑇𝐹𝑖𝜇𝑖
Re-writing constraint
Its tight at optimality
![Page 25: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/25.jpg)
Jain, Hu
Optimizing via Sub-gradient
𝑐 𝑤 = min𝑤,𝜉𝑖
𝜆
2𝑤 2 +
1
𝑛 𝛽𝑖𝑖
max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝑤𝑇𝐹𝑖𝜇𝑖
Ratliff, PhD Thesis
![Page 26: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/26.jpg)
Jain, Hu
𝜇∗ = argmax𝜇∈G
𝑖
𝑤𝑡𝑇𝐹𝑖 + 𝑙𝑖
𝑇 𝜇
min𝑤,𝜉𝑖
𝜆
2𝑤 2 +
1
𝑛 𝛽𝑖𝑖
max𝜇∈G
𝑖
𝑤𝑇𝐹𝑖𝜇 + 𝑙𝑖𝑇𝜇 − 𝑤𝑇𝐹𝑖𝜇𝑖
Weight update via Sub-gradient
𝑤𝑡+1 ← 𝑤𝑡 − 𝛼𝛻𝑤𝑡𝑐(𝑤)
Standard gradient descent update
𝑤𝑡+1 ← 𝑤𝑡 − 𝛼𝑔𝑤𝑡
𝑔𝑤𝑡 = 𝜆𝑤𝑡 +1
𝑛 𝛽𝑖𝐹𝑖(𝜇
∗ − 𝜇𝑖)
𝑖
Loss-augmented reward
![Page 27: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/27.jpg)
Jain, Hu
Convergence summary
𝑐 𝑤 =𝜆
2𝑤 2 +
1
𝑛 𝑟𝑖 𝑤
𝑛
𝑖=1
𝑤 ← 𝑤 − 𝛼𝑔𝑤
How to chose 𝛼?
∀𝑤 ‖𝑔𝑤‖ ≤ 𝐺 0 < 𝛼 ≤1
𝜆 If then a constant step size
gives linear convergence to a region around the minimum
𝑤𝑡+1 −𝑤∗ 2 ≤ 𝜅𝑡+1 𝑤0 −𝑤
∗ 2 +𝛼𝐺2
𝜆
Optimization error
0 < 𝜅 < 1
![Page 28: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/28.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space
– Activity
Introducing Inverse RL – Max-margin formulation
– Optimization algorithms
– Activity
Conclusion
![Page 29: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/29.jpg)
Jain, Hu
Substitute 𝜆
2𝑤𝑡 −𝑤
∗ 2 ≤ 𝑔𝑡𝑇(𝑤𝑡 −𝑤
∗)
Convergence proof
𝑤𝑡+1 −𝑤∗ 2 = 𝑤𝑡 − 𝛼𝑔𝑡 −𝑤
∗ 2
= 𝑤𝑡 −𝑤∗ 2 + 𝛼2 𝑔𝑡
2 − 2𝛼𝑔𝑡𝑇(𝑤𝑡 −𝑤
∗) ?
≤ 1 − 𝛼𝜆 𝑤𝑡 −𝑤∗ 2 + 𝛼2𝐺2
∀𝑤 ‖𝑔𝑤‖ ≤ 𝐺 We know
𝑐 𝑤 ≥ 𝑐 𝑤′ + 𝑔𝑤′𝑇 𝑤 −𝑤′ +
𝜆
2𝑤 − 𝑤′ 2
𝑤′
𝑤 𝑤 ← 𝑤∗ 𝑤′ ← 𝑤𝑡
𝑐 𝑤∗ ≤ 𝑐(𝑤𝑡) Use
𝑐(𝑤)
![Page 30: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/30.jpg)
Jain, Hu
Outline
Basics of Reinforcement Learning – State and action space
– Activity
Introducing Inverse RL – Max-margin formulation
– Optimization algorithms
– Activity
Conclusion
![Page 31: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/31.jpg)
Jain, Hu
2D path planning
Expert’s Demonstration
Learned Reward
Planning in a new Environment
![Page 32: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/32.jpg)
Jain, Hu
Car driving (Abbeel et. al)
![Page 33: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/33.jpg)
Jain, Hu
Bad driver (Abbeel et. al)
![Page 34: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/34.jpg)
Jain, Hu
Recent works
1. N. Ratliff, D. Bradley, D. Bagnell, J. Chestnutt. Boosting Structures Prediction for Imitation Learning. In NIPS 2007
2. B. Ziebart, A. Mass, D. Bagnell, A. K. Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI 2010
3. P. Abbeel, A. Coats, A. Ng. Autonomous helicopter aerobatics through apprenticeship learning. In IJRR 2010
4. S. Ross, G. Gordon, D. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In AISTATS 2011
5. B. Akgun, M. Cakmak, K. Jiang, A. Thomaz. Keyframe-based leaning from demonstrations. In IJSR 2012
6. A. Jain, B. Wojcik, T. Joachims, A. Saxena. Learning Trajectory Preferences for Manipulators via Iterative Improvement. In NIPS 2013
![Page 35: Maximum Margin Planning - Cornell University · Problem specific QP-formulation min ,𝜉𝑖 2 2+ 1 𝑛 𝑖𝜉𝑖 𝑖 . . ∀𝑖 𝑇 𝑖 𝑖≥max 𝜇∈G 𝑖 𝑇 𝑖](https://reader034.vdocuments.us/reader034/viewer/2022050518/5fa24649a3197f762c5ce1d3/html5/thumbnails/35.jpg)
Jain, Hu
Questions
• Summary of key points:
– Reward-based learning can be modeled by the MDP framework.
– Inverse Reinforcement Learning takes perception features as input and outputs a reward function.
– The max-margin planning problem can be formulated as a tractable QP.
– Sub-gradient descent solves an equivalent problem with provable convergence bound.