Computing near-optimal policies from trajectoriesby solving a sequence of standard supervised
learning problems
Damien Ernst
Hybrid Systems Control Group - Supelec
TU Delft, November 16, 2006
Damien Ernst Computing .... (1/22)
From trajectories to optimal policies: control community
versus computer science community
Problem of inference of (near) optimal policies from trajectorieshas been studied by the control community and the computerscience community.
Control community favors a two-stage approach: identification ofan analytical model and derivation from it of an optimal policy.
Computer science community favors approaches which computeoptimal policies directly from the trajectories, without relying onthe identification of an analytical model.
Learning optimal policies from trajectories is known in thecomputer science community as reinforcement learning.
Damien Ernst Computing .... (2/22)
Reinforcement learning
Classical approach: to infer from trajectories state-action valuefunctions.
Discrete (and not to large) state and action spaces: state-actionvalue functions can be represented in tabular form.
Continuous or large state and action spaces: functionapproximators need to be used. Also, learning must be done fromfinite and generally very sparse sets of trajectories.
Parametric function approximators with gradient descent likemethods: very popular techniques but do not generalize wellenough to move from the academic to the real-world !!!
Damien Ernst Computing .... (3/22)
Using supervised learning to solve the generalization
problem in reinforcement learning
Problem of generalization over an information space. Occurs alsoin supervised learning (SL).
What is SL ? To infer from a sample of input-output pairs (input= information state; output = class label or real number) a modelwhich explains “at best” these input-output pairs.
Supervised learning highly successful: state-of-the art SLalgorithms have been successfully applied to problems whereinformation state = thousands of components.
What we want: we want to use the generalization capabilitiesof supervised learning in reinforcement learning.
Our answer: an algorithm named fitted Q iteration whichinfers (near) optimal policies from trajectories by solving asequence of standard supervised learning problems.
Damien Ernst Computing .... (4/22)
Learning from a sample of trajectories: the RL approach
Problem formulation Deterministic version
Discrete-time dynamics: xt+1 = f (xt , ut) t = 0, 1, . . . wherext ∈ X and ut ∈ U.
Cost function: c(x , u) : X × U → R. c(x , u) bounded by Bc .Instantaneous cost: ct = c(xt , ut)
Discounted infinite horizon cost associated to stationary policyµ : X → U: Jµ(x) = lim
N→∞
∑N−1t=0 γ
tc(xt , µ(xt)) where γ ∈ [0, 1[.
Optimal stationary policy µ∗ : Policy that minimizes Jµ for all x .
Objective: Find an optimal policy µ∗.
We do not know: The discrete-time dynamics and the costfunction.
We know instead: A set of trajectories{(x0, u0, c0, x1, · · · , uT−1, cT−1, xT )i}nbTraj
i=1 .
Damien Ernst Computing .... (5/22)
Some dynamic programming results
Sequence of state-action value functions QN : X × U → R
QN(x , u) = c(x , u) + γminu′∈U
QN−1(f (x , u), u′), ∀N > 1
with Q1(x , u) ≡ c(x , u), converges to the Q-function, uniquesolution of the Bellman equation:
Q(x , u) = c(x , u) + γminu′∈U
Q(f (x , u), u′).
Necessary and sufficient optimality condition:
µ∗(x) ∈ arg minu∈U
Q(x , u)
Suboptimal stationary policy µ∗N :
µ∗N(x) ∈ arg minu∈U
QN(x , u).
Bound on µ∗N :
Jµ∗N − Jµ
∗≤
2γNBc
(1 − γ)2.
Damien Ernst Computing .... (6/22)
Fitted Q iteration: the algorithm
Set of trajectories {(x0, u0, c0, x1, · · · , cT−1, uT−1, xT )i}nbTraji=1
transformed into a set of system transitionsF = {(x l
t , ult , c
lt , x
lt+1)}
#F
l=1 .
Fitted Q iteration computes from F the functions Q1, Q2, . . .,QN , approximations of Q1, Q2, . . ., QN .
Computation done iteratively by solving a sequence of standardsupervised learning (SL) problems. Training sample for the kth
(k ≥ 1) problem is
{(
(x lt , u
lt), c l
t + γminu∈U
Qk−1(xlt+1, u)
)}#F
l=1
with Q0(x , u) ≡ 0. From the kth training sample, the supervisedlearning algorithm outputs Qk .
µ∗N(x) ∈ arg minu∈U
QN(x , u) is taken as approximation of µ∗(x).
Damien Ernst Computing .... (7/22)
Fitted Q iteration: some remarks
Performances of the algorithm depends on the supervised learning(SL) method chosen.
Excellent performances have been observed when combined withsupervised learning methods based on ensemble of regression trees.
Works also for stochastic systems
Consistency can be ensured under appropriate assumptions on theSL method, the sampling process, the system dynamics and thecost function.
Damien Ernst Computing .... (8/22)
Illustration I: Bicycle balancing and riding
ϕh
d + w FcenCM
ω
Mg
x-axisψgoal
contact
θ
T
front wheel
back wheel - ground
goal
(xb, yb)
frame of the bike
center of goal (coord. = (xgoal , ygoal ))
I Stochastic problemI Seven states variables and two control actionsI Time between t and t + 1= 0.01 s
I Cost function of the type:
c(xt , ut) =
{
1 if bicycle has fallen
coeff . ∗ (|ψgoalt+1)| − |ψgoalt |) otherwise
Damien Ernst Computing .... (9/22)
Illustration I: Results
Trajectories generation: action taken at random, bicycle initially farfrom the goal and trajectory ends when bicycle falls.
goal
yb
0 250 500 750 1000
−100
−50
0
50
100
xb
−150ψ0 = −
3π4
ψ0 = π
ψ0 = −π4
ψ0 = −π2
ψ0 = 0
ψ0 = π4
ψ0 = π2
ψ0 = 3π4
ψ0 = −π
100 300 500 700 900 Nb oftrajectories
successpercent
successpercenthundred
one
zero
Q-learning with neural networks: 100, 000 times trajectoriesneeded to compute a policy that drives the bicycle to the goal !!!!
Damien Ernst Computing .... (10/22)
Illustration II: Navigation from visual percepts
pt
0
0
20
100
100
p(0)
p(1)
pt+1(1) =
min(pt (1) + 25, 100)
possibleactions
c(p, u) = 0 c(p, u) = −1
c(p, u) = −2
20
ut = go up
0
0
20
100
100
p(0)
p(1)
20
0
0
100
100
p(0)
p(1)
pt
10 pixels
pt
15 pixels
100 pixels 30 pixels
the agent is in positon pt
186 186 103 103 250 250 250250
is the (30 ∗ i + j)th element of this vector.
the grey level of the pixel located at the ith line
grey level ∈ {0, 1, · · · , 255}
pixels(pt ) = 30*30 element vector such that
pixels(pt)
and the jth column of the observation image
observation image when
Damien Ernst Computing .... (11/22)
0
0
100
100
p(0)
p(1)
0
0
100
100
p(0)
p(1)
(a) µ∗10, 500 system trans. (b) µ∗10, 2000 system trans.
0
0
100
100
p(0)
p(1)
1000 5000 7000 90000.4
0.6
0.8
3000
1.
p used as state input
Optimal score (Jµ∗
)
#F
Jµ∗10
pixels(p) used as state input
(c) µ∗10, 8000 system trans. (d) score versus nb system trans.
Damien Ernst Computing .... (12/22)
Illustration III: Computation of Structured Treatment
Interruption (STI) for HIV+ patients
I STI for HIV: to cycle the patient on and off drug therapy
I In some remarkable cases, STI strategies have enabled thepatients to maintain immune control over the virus in theabsence of treatment
I STIs offer patients periods of relief from treatment
I We want to compute optimal STI strategies.
Damien Ernst Computing .... (13/22)
Illustration III: What can reinforcement learning techniques
offer ?
I Have the potential to infer from clinical data good STIstrategies, without modeling the HIV infection dynamics.
I Clinical data: time evolution of patient’s state (CD4+ T cellcount, systemic costs of the drugs, etc) recorded atdiscrete-time instant and sequence of drugs administered.
I Clinical data can be seen as trajectories of the immune systemresponding to treatment.
Damien Ernst Computing .... (14/22)
The trajectories are processedby using reinforcement learning techniques
patients
A pool ofHIV infected
problem which typically containts the following information:
some (near) optimal STI strategies,often under the form of a mapping
given time and the drugs he has to take
protocols and are monitored at regular intervalsThe patients follow some (possibly suboptimal) STI
The monitoring of each patient generates a trajectory for the optimal STI
drugs taken by the patient between t0 and t1 = t0 + n daysstate of the patient at time t0
state of the patient at time t1drugs taken by the patient between t1 and t2 = t1 + n daysstate of the patient at time t2drugs taken by the patient between t2 and t3 = t2 + n days
Processing of the trajectories gives
between the state of the patient at a
till the next time his state is monitored.
Figure: Determination of optimal STI strategies from clinical data byusing reinforcement learning algorithms: the overall principle.
Damien Ernst Computing .... (15/22)
Illustration III: state of the research
I Promising results have been obtained by using “fitted Q
iteration” to analyze some artificially generated clinical data.I Next step: to analyze real-life clinical data generated by the
SMART study (Smart Management of Anti-RetroviralTherapies).
Figure: Taken fromhttp://www.cpcra.org/docs/pubs/2006/croi2006-smart.pdf
Damien Ernst Computing .... (16/22)
Conclusions
I Fitted Q iteration algorithm (combined with ensemble ofregression trees) has been evaluated on several problems andwas consistently performing much better than otherreinforcement learning algorithms.
I Are its performances sufficient to lead to many successfulreal-life applications ? I think YES !!!
I Why has this algorithm not been proposed before ?
Damien Ernst Computing .... (17/22)
Fitted Q iteration: future research
I Computational burdens grow with the number of trajectories⇒ problems with on-line applications. Possible solution: tokeep only the most informative trajectories.
I What are really the best supervised learning algorithm to usein the inner loop of the fitted Q iteration process ? Severalcriteria need to be considered: distribution of the data,computational burdens, numerical stability of the algorithm,etc.
I Customization of SL methods (e.g. split criteria in trees otherthan variance reduction, etc)
Damien Ernst Computing .... (18/22)
Supervised learning in dynamic programming: general view
I Dynamic programming: resolution of optimal control problemsby extending iteratively the optimization horizon.
I Two main algorithms: value iteration and policy iteration.
I Fitted Q iteration: based on the value iteration algorithm.Fitted Q iteration can be extended to the case wheredynamics and cost function are known to become anApproximate Value Iteration algorithm (AVI).
I Approximate Policy Iteration algorithms (API) based on SLhave recently been proposed. Can also be adapted to the casewhere only trajectories are available.
I For both SL based AVI and API, problem of generation of theright trajectories is extremely important.
I How to put all these works into a unified framework ?
Damien Ernst Computing .... (19/22)
Beyond dynamic programming...
I Standard Model Predictive Control (MPC) formulation: solvein a receding time manner a sequence of open-loopdeterministic optimal control problems by relying on someclassical optimization algorithms (sequential quadraticprogramming, interior point methods, etc)
I Could SL based dynamic programming be good optimizers forMPC ? Have at least the advantage of not being intrinsicallysuboptimal when the system is stochastic !!! But problem withcomputation of max
u∈Umodel(x , u) when dealing with large U.
I How to extend MPC-like formulations to stochastic systems ?Solutions have been proposed for problems with discretedisturbance spaces but dimension of the search space for theoptimization algorithm isO(number of disturbancesoptimization horizon). Researchdirection: selection of a subset of relevant disturbances.
Damien Ernst Computing .... (20/22)
Planning under uncertainties: an example
Development of MPC-like algorithms for scheduling productionunder uncertainties through selection of “interesting disturbancescenarios”.
Quantityproduced
u
Clouds
x
wCost
demand - u
Customers
Hydraulicproduction
productionThermal
Figure: A typical hydro power plant production scheduling problem.
Damien Ernst Computing .... (21/22)
Additional readings“Tree-based batch mode reinforcement learning”. D. Ernst, P. Geurtsand L. Wehenkel. In Journal of Machine Learning Research. April 2005,Volume 6, pages 503-556.“Selecting concise sets of samples for a reinforcement learning agent”. D.Ernst. In Proceedings of CIRAS 2005, 14-16 December 2005, Singapore.(6 pages)“Clinical data based optimal STI strategies for HIV: a reinforcementlearning approach”. D. Ernst, G.B. Stan, J. Goncalves and L. Wehenkel.In Proceedings of Benelearn 2006, 11-12 May 2006, Ghent, Belgium. (8pages)“Reinforcement learning with raw image pixels as state input”. D. Ernst,R. Maree and L. Wehenkel. International Workshop on IntelligentComputing in Pattern Analysis/Synthesis (IWICPAS). Proceedings series:LNCS, Volume 4153, page 446-454, August 2006.”Reinforcement learning versus model predictive control: a comparison”.D. Ernst, M. Glavic, F. Capitanescu and L. Wehenkel. Submitted.
“Planning under uncertainties by solving non-linear optimization
problems: a complexity analysis”. D. Ernst, B. Defourny and L.
Wehenkel. In preparation.
Damien Ernst Computing .... (22/22)