algorithms for inverse reinforcement learning andrew ng and stuart russell
DESCRIPTION
1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning. Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/1.jpg)
1. Algorithms for Inverse Reinforcement Learning
2. Apprenticeship learning via Inverse Reinforcement Learning
![Page 2: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/2.jpg)
Algorithms for Inverse Reinforcement Learning
Andrew Ng and Stuart Russell
![Page 3: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/3.jpg)
Motivation
● Given: (1) measurements of an agent's behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment.
● Determine: the reward function being optimized.
![Page 4: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/4.jpg)
Why?
● Reason #1: Computational models for animal and human learning.
● “In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.”
● Particularly true of multiattribute reward functions (e.g. Bee foraging: amount of nectar vs. flight time vs. risk from wind/predators)
![Page 5: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/5.jpg)
Why?
● Reason #2: Agent construction.● “An agent designer [...] may only have a very
rough idea of the reward function whose optimization would generate 'desirable' behavior.”
● e.g. “Driving well”● Apprenticeship learning: Recovering expert's
underlying reward function more “parsimonious” than learning expect's policy?
![Page 6: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/6.jpg)
Possible applications in multi-agent systems
● In multi-agent adversarial games, learning opponents’ reward functions that guild their actions to devise strategies against them.
● example● In mechanism design, learning each agent’s
reward function from histories to manipulate its actions.
● and more?
![Page 7: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/7.jpg)
Inverse Reinforcement Learning (1) – MDP Recap
• MDP is represented as a tuple (S, A, {Psa}, ,R)
Note: R is bounded by Rmax
• Value function for policy :
• Q-function:
]|..........)()([)( 211 sRsREsV
)]'([)(),( (.)~' sVEsRasQsaPs
![Page 8: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/8.jpg)
Inverse Reinforcement Learning (1) – MDP Recap
• Bellman Equation:
• Bellman Optimality:
')(
')(
)'()'()()(
)'()'()()(
sss
sss
sVsPsRsQ
sVsPsRsV
),(maxarg)( asQsAa
![Page 9: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/9.jpg)
Inverse Reinforcement Learning (2)Finite State Space
• Reward function solution set (a1 is optimal action)
RPIV
VPRV
a
a
1)(1
1
0))((
)()(
),(maxarg)(
1
11
1
11
111
1
RPIPP
PIPRPIP
VPVP
asQsa
aaa
aaaa
aa
Aa
1\ aAa
1\ aAa
1\ aAa
![Page 10: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/10.jpg)
Inverse Reinforcement Learning (2)Finite State Space
)),(max),((1\
1 asQasQSs
aAa
There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution?
1. Make deviation from as costly as possible:
2. Make reward function as simple as possible
![Page 11: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/11.jpg)
Inverse Reinforcement Learning (2)Finite State Space
• Linear Programming Formulation:
NiRR
RPIiPiPst
RRPIiPiP
i
aaa
N
iaaaaaaa k
,....,1,
0)))(()(.(
})))(()({(minmax
max
1
11
1},....,,{
11
1132
1\ aAa
a1, a2, …, an
R?
Va1
Va2maximized
![Page 12: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/12.jpg)
Inverse Reinforcement Learning (3)Large State Space
• Linear approximation of reward function (in driving example, basis functions can be collision, stay on right lane,…etc)
• Let be value function of policy , when reward R =
• For R to make optimal
)(........)()()( 2211 ssssR dd
)(1 sa
)]'([)]'([ ~'~'1
sVEsVEsasa PsPs
1\ aAa
)(si
ddVVVV ........2211
iV )(si
![Page 13: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/13.jpg)
Inverse Reinforcement Learning (3)Large State Spaces
• In an infinite or large number of state space, it is usually not possible to check all constraints:
• Choose a finite subset S0 from all states
• Linear Programming formulation, find αi that:
• x>=0, p(x)=x; otherwise p(x)=2x
)]'([)]'([ ~'~'1
sVEsVEsasa PsPs
dits
sVEsVEp
i
PsPsSs
aaaa sasak
,....,1,1..
)])}'([)]'([({minmax ~'~'},{1
0
,....,32
![Page 14: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/14.jpg)
Inverse Reinforcement Learning (4)IRL from Sample Trajectories
• If is only accessible through a set of sampled trajectories (e.g. driving demo in 2nd paper)
• Assume we start from a dummy state s0,(whose next state distribution is according to D).
• In the case that reward trajectory state sequence (s0, s1, s2….):
iR
..........)()()()( 22
100
^
ssssV iiii
)(.........)()( 0
^
01
^
10
^
sVsVsV dd
![Page 15: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/15.jpg)
Inverse Reinforcement Learning (4)IRL from Sample Trajectories
• Assume we have some set of policies
• Linear Programming formulation
• The above optimization gives a new reward R, we then compute based on R, and add it to the set of policies
• reiterate
dits
sVsVp
i
k
i
i
....1,1..
))()((max1
0
^
0
*^
k ,....,, 21
1k
![Page 16: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/16.jpg)
Discrete Gridworld Experiment
● 5x5 grid world● Agent starts in bottom-left square.● Reward of 1 in the upper-right square.● Actions = N,W,S,E (30% chance of random)
![Page 17: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/17.jpg)
Discrete Gridworld Results
![Page 18: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/18.jpg)
Mountain Car Experiment #1
● Car starts in valley, goal is at the top of hill● Reward is -1 per “step” until goal is reached● State = car's x-position & velocity (continuous!)● Function approx. class: all linear combinations
of 26 evenly spaced Gaussian-shaped basis functions
![Page 19: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/19.jpg)
Mountain Car Experiment #2
● Goal is in bottom of valley● Car starts... not sure. Top of hill?● Reward is 1 in the goal area, 0 elsewhere● γ = 0.99● State = car's x-position & velocity (continuous!)● Function approx. class: all linear combinations
of 26 evenly spaced Gaussian-shaped basis functions
![Page 20: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/20.jpg)
Mountain Car Results
#1
#2
![Page 21: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/21.jpg)
Continuous Gridworld Experiment
● State space is now [0,1] x [0,1] continuous grid● Actions: 0.2 movement in any direction + noise
in x and y coordinates of [-0.1,0.1]● Reward 1 in region [0.8,1] x [0.8,1], 0 elsewhere● γ = 0.9● Function approx. class: all linear combinations
of a 15x15 array of 2-D Gaussian-shaped basis functions
● m=5000 trajectories of 30 steps each per policy
![Page 22: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/22.jpg)
Continuous Gridworld Results
3%-10% error when comparing fitted reward's optimal policy with the true optimal policy
However, no significant difference in quality of policy (measured using true reward function)
![Page 23: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/23.jpg)
Apprenticeship Learning via Inverse Reinforcement Learning
Pieter Abbeel & Andrew Y. Ng
![Page 24: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/24.jpg)
Algorithm
● For t = 1,2,…
Inverse RL step: Estimate expert’s reward function R(s)=
wT(s) such that under R(s) the expert performs better than all previously found policies {i}.
RL step: Compute optimal policy t for the estimated reward w.
Courtesy of Pieter Abbeel
![Page 25: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/25.jpg)
Algorithm: IRL step
● Maximize , w:||w||2
≤ 1
● s.t. Vw(E) Vw(i) + i=1,…,t-1
● = margin of expert’s performance over the performance of previously found policies.
● Vw() = E [t t R(st)|] = E [t t wT(st)|]
● = wT E [t t (st)|]
● = wT ()
● () = E [t t (st)|] are the “feature expectations”
Courtesy of Pieter Abbeel
![Page 26: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/26.jpg)
Feature Expectation Closeness and Performance
● If we can find a policy such that
● ||(E) - ()||2 ,● then for any underlying reward R*(s) =w*T(s), ● we have that
● |Vw*(E) - Vw*()| = |w*T (E) - w*T ()|
● ||w*||2 ||(E) - ()||2
● .
Courtesy of Pieter Abbeel
![Page 27: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/27.jpg)
IRL step as Support Vector Machine
maximum margin hyperplane seperatingtwo sets of points
(E)
()
|w*T (E) - w*T ()|= |Vw*(E) - Vw*()| = maximal difference between expert policy’s value function and 2nd to the optimal policy’s value function
![Page 28: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/28.jpg)
1
(0)
w(1)
w(2)(1)
(2)
2
w(3)
Uw() = wT()
(E)
Courtesy of Pieter Abbeel
![Page 29: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/29.jpg)
Gridworld Experiment
● 128 x 128 grid world divided into 64 regions, each of size 16 x 16 (“macrocells”).
● A small number of macrocells have positive rewards.
● For each macrocell, there is one feature Φi(s)
indicating whether that state s is in macrocell i● Algorithm was also run on the subset of
features Φi(s) that correspond to non-zero
rewards.
![Page 30: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/30.jpg)
Gridworld Results
Performance vs. # TrajectoriesDistance to expert vs. # Iterations
![Page 31: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/31.jpg)
Car Driving Experiment
● No explict reward function at all!● Expert demonstrates proper policy via 2 min. of
driving time on simulator (1200 data points).● 5 different “driver types” tried.● Features: which lane the car is in, distance to
closest car in current lane.● Algorithm run for 30 iterations, policy hand-
picked.● Movie Time! (Expert left, IRL right)
![Page 32: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/32.jpg)
Demo-1 Nice
![Page 33: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/33.jpg)
Demo-2 Right Lane Nasty
![Page 34: Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b1c550346895dc8cf6b/html5/thumbnails/34.jpg)
Car Driving Results