computing near-optimal policies from trajectories by solving a sequence of standard supervised...

Computing near-optimal policies from trajectoriesby solving a sequence of standard supervised

learning problems

Damien Ernst

Hybrid Systems Control Group - Supelec

TU Delft, November 16, 2006

Damien Ernst Computing .... (1/22)

From trajectories to optimal policies: control community

versus computer science community

Problem of inference of (near) optimal policies from trajectorieshas been studied by the control community and the computerscience community.

Control community favors a two-stage approach: identification ofan analytical model and derivation from it of an optimal policy.

Computer science community favors approaches which computeoptimal policies directly from the trajectories, without relying onthe identification of an analytical model.

Learning optimal policies from trajectories is known in thecomputer science community as reinforcement learning.


Reinforcement learning

Classical approach: to infer from trajectories state-action valuefunctions.

Discrete (and not to large) state and action spaces: state-actionvalue functions can be represented in tabular form.

Continuous or large state and action spaces: functionapproximators need to be used. Also, learning must be done fromfinite and generally very sparse sets of trajectories.

Parametric function approximators with gradient descent likemethods: very popular techniques but do not generalize wellenough to move from the academic to the real-world !!!


Using supervised learning to solve the generalization

problem in reinforcement learning

Problem of generalization over an information space. Occurs alsoin supervised learning (SL).

What is SL ? To infer from a sample of input-output pairs (input= information state; output = class label or real number) a modelwhich explains “at best” these input-output pairs.

Supervised learning highly successful: state-of-the art SLalgorithms have been successfully applied to problems whereinformation state = thousands of components.

What we want: we want to use the generalization capabilitiesof supervised learning in reinforcement learning.

Our answer: an algorithm named fitted Q iteration whichinfers (near) optimal policies from trajectories by solving asequence of standard supervised learning problems.


Learning from a sample of trajectories: the RL approach

Problem formulation Deterministic version

Discrete-time dynamics: xt+1 = f (xt , ut) t = 0, 1, . . . wherext ∈ X and ut ∈ U.

Cost function: c(x , u) : X × U → R. c(x , u) bounded by Bc .Instantaneous cost: ct = c(xt , ut)

Discounted infinite horizon cost associated to stationary policyµ : X → U: Jµ(x) = lim

N→∞

∑N−1t=0 γ

tc(xt , µ(xt)) where γ ∈ [0, 1[.

Optimal stationary policy µ∗ : Policy that minimizes Jµ for all x .

Objective: Find an optimal policy µ∗.

We do not know: The discrete-time dynamics and the costfunction.

We know instead: A set of trajectories{(x0, u0, c0, x1, · · · , uT−1, cT−1, xT )i}nbTraj

i=1 .


Some dynamic programming results

Sequence of state-action value functions QN : X × U → R

QN(x , u) = c(x , u) + γminu′∈U

QN−1(f (x , u), u′), ∀N > 1

with Q1(x , u) ≡ c(x , u), converges to the Q-function, uniquesolution of the Bellman equation:

Q(x , u) = c(x , u) + γminu′∈U

Q(f (x , u), u′).

Necessary and sufficient optimality condition:

µ∗(x) ∈ arg minu∈U

Q(x , u)

Suboptimal stationary policy µ∗N :

µ∗N(x) ∈ arg minu∈U

QN(x , u).

Bound on µ∗N :

Jµ∗N − Jµ

∗≤

2γNBc

(1 − γ)2.


Fitted Q iteration: the algorithm

Set of trajectories {(x0, u0, c0, x1, · · · , cT−1, uT−1, xT )i}nbTraji=1

transformed into a set of system transitionsF = {(x l

t , ult , c

lt , x

lt+1)}

#F

l=1 .

Fitted Q iteration computes from F the functions Q1, Q2, . . .,QN , approximations of Q1, Q2, . . ., QN .

Computation done iteratively by solving a sequence of standardsupervised learning (SL) problems. Training sample for the kth

(k ≥ 1) problem is

{(

(x lt , u

lt), c l

t + γminu∈U

Qk−1(xlt+1, u)

)}#F

l=1

with Q0(x , u) ≡ 0. From the kth training sample, the supervisedlearning algorithm outputs Qk .

µ∗N(x) ∈ arg minu∈U

QN(x , u) is taken as approximation of µ∗(x).


Fitted Q iteration: some remarks

Performances of the algorithm depends on the supervised learning(SL) method chosen.

Excellent performances have been observed when combined withsupervised learning methods based on ensemble of regression trees.

Works also for stochastic systems

Consistency can be ensured under appropriate assumptions on theSL method, the sampling process, the system dynamics and thecost function.


Illustration I: Bicycle balancing and riding

ϕh

d + w FcenCM

ω

Mg

x-axisψgoal

contact

θ

T

front wheel

back wheel - ground

goal

(xb, yb)

frame of the bike

center of goal (coord. = (xgoal , ygoal ))

I Stochastic problemI Seven states variables and two control actionsI Time between t and t + 1= 0.01 s

I Cost function of the type:

c(xt , ut) =

{

1 if bicycle has fallen

coeff . ∗ (|ψgoalt+1)| − |ψgoalt |) otherwise


Illustration I: Results

Trajectories generation: action taken at random, bicycle initially farfrom the goal and trajectory ends when bicycle falls.

goal

yb

0 250 500 750 1000

−100

−50

0

50

100

xb

−150ψ0 = −

3π4

ψ0 = π

ψ0 = −π4

ψ0 = −π2

ψ0 = 0

ψ0 = π4

ψ0 = π2

ψ0 = 3π4

ψ0 = −π

100 300 500 700 900 Nb oftrajectories

successpercent

successpercenthundred

one

zero

Q-learning with neural networks: 100, 000 times trajectoriesneeded to compute a policy that drives the bicycle to the goal !!!!


Illustration II: Navigation from visual percepts

pt

0

0

20

100

100

p(0)

p(1)

pt+1(1) =

min(pt (1) + 25, 100)

possibleactions

c(p, u) = 0 c(p, u) = −1

c(p, u) = −2

20

ut = go up

0

0

20

100

100

p(0)

p(1)

20

0

0

100

100

p(0)

p(1)

pt

10 pixels

pt

15 pixels

100 pixels 30 pixels

the agent is in positon pt

186 186 103 103 250 250 250250

is the (30 ∗ i + j)th element of this vector.

the grey level of the pixel located at the ith line

grey level ∈ {0, 1, · · · , 255}

pixels(pt ) = 30*30 element vector such that

pixels(pt)

and the jth column of the observation image

observation image when


0

0

100

100

p(0)

p(1)

0

0

100

100

p(0)

p(1)

(a) µ∗10, 500 system trans. (b) µ∗10, 2000 system trans.

0

0

100

100

p(0)

p(1)

1000 5000 7000 90000.4

0.6

0.8

3000

1.

p used as state input

Optimal score (Jµ∗

)

#F

Jµ∗10

pixels(p) used as state input

(c) µ∗10, 8000 system trans. (d) score versus nb system trans.


Illustration III: Computation of Structured Treatment

Interruption (STI) for HIV+ patients

I STI for HIV: to cycle the patient on and off drug therapy

I In some remarkable cases, STI strategies have enabled thepatients to maintain immune control over the virus in theabsence of treatment

I STIs offer patients periods of relief from treatment

I We want to compute optimal STI strategies.


Illustration III: What can reinforcement learning techniques

offer ?

I Have the potential to infer from clinical data good STIstrategies, without modeling the HIV infection dynamics.

I Clinical data: time evolution of patient’s state (CD4+ T cellcount, systemic costs of the drugs, etc) recorded atdiscrete-time instant and sequence of drugs administered.

I Clinical data can be seen as trajectories of the immune systemresponding to treatment.


The trajectories are processedby using reinforcement learning techniques

patients

A pool ofHIV infected

problem which typically containts the following information:

some (near) optimal STI strategies,often under the form of a mapping

given time and the drugs he has to take

protocols and are monitored at regular intervalsThe patients follow some (possibly suboptimal) STI

The monitoring of each patient generates a trajectory for the optimal STI

drugs taken by the patient between t0 and t1 = t0 + n daysstate of the patient at time t0

state of the patient at time t1drugs taken by the patient between t1 and t2 = t1 + n daysstate of the patient at time t2drugs taken by the patient between t2 and t3 = t2 + n days

Processing of the trajectories gives

between the state of the patient at a

till the next time his state is monitored.

Figure: Determination of optimal STI strategies from clinical data byusing reinforcement learning algorithms: the overall principle.


Illustration III: state of the research

I Promising results have been obtained by using “fitted Q

iteration” to analyze some artificially generated clinical data.I Next step: to analyze real-life clinical data generated by the

SMART study (Smart Management of Anti-RetroviralTherapies).

Figure: Taken fromhttp://www.cpcra.org/docs/pubs/2006/croi2006-smart.pdf


Conclusions

I Fitted Q iteration algorithm (combined with ensemble ofregression trees) has been evaluated on several problems andwas consistently performing much better than otherreinforcement learning algorithms.

I Are its performances sufficient to lead to many successfulreal-life applications ? I think YES !!!

I Why has this algorithm not been proposed before ?


Fitted Q iteration: future research

I Computational burdens grow with the number of trajectories⇒ problems with on-line applications. Possible solution: tokeep only the most informative trajectories.

I What are really the best supervised learning algorithm to usein the inner loop of the fitted Q iteration process ? Severalcriteria need to be considered: distribution of the data,computational burdens, numerical stability of the algorithm,etc.

I Customization of SL methods (e.g. split criteria in trees otherthan variance reduction, etc)


Supervised learning in dynamic programming: general view

I Dynamic programming: resolution of optimal control problemsby extending iteratively the optimization horizon.

I Two main algorithms: value iteration and policy iteration.

I Fitted Q iteration: based on the value iteration algorithm.Fitted Q iteration can be extended to the case wheredynamics and cost function are known to become anApproximate Value Iteration algorithm (AVI).

I Approximate Policy Iteration algorithms (API) based on SLhave recently been proposed. Can also be adapted to the casewhere only trajectories are available.

I For both SL based AVI and API, problem of generation of theright trajectories is extremely important.

I How to put all these works into a unified framework ?


Beyond dynamic programming...

I Standard Model Predictive Control (MPC) formulation: solvein a receding time manner a sequence of open-loopdeterministic optimal control problems by relying on someclassical optimization algorithms (sequential quadraticprogramming, interior point methods, etc)

I Could SL based dynamic programming be good optimizers forMPC ? Have at least the advantage of not being intrinsicallysuboptimal when the system is stochastic !!! But problem withcomputation of max

u∈Umodel(x , u) when dealing with large U.

I How to extend MPC-like formulations to stochastic systems ?Solutions have been proposed for problems with discretedisturbance spaces but dimension of the search space for theoptimization algorithm isO(number of disturbancesoptimization horizon). Researchdirection: selection of a subset of relevant disturbances.


Planning under uncertainties: an example

Development of MPC-like algorithms for scheduling productionunder uncertainties through selection of “interesting disturbancescenarios”.

Quantityproduced

u

Clouds

x

wCost

demand - u

Customers

Hydraulicproduction

productionThermal

Figure: A typical hydro power plant production scheduling problem.


Additional readings“Tree-based batch mode reinforcement learning”. D. Ernst, P. Geurtsand L. Wehenkel. In Journal of Machine Learning Research. April 2005,Volume 6, pages 503-556.“Selecting concise sets of samples for a reinforcement learning agent”. D.Ernst. In Proceedings of CIRAS 2005, 14-16 December 2005, Singapore.(6 pages)“Clinical data based optimal STI strategies for HIV: a reinforcementlearning approach”. D. Ernst, G.B. Stan, J. Goncalves and L. Wehenkel.In Proceedings of Benelearn 2006, 11-12 May 2006, Ghent, Belgium. (8pages)“Reinforcement learning with raw image pixels as state input”. D. Ernst,R. Maree and L. Wehenkel. International Workshop on IntelligentComputing in Pattern Analysis/Synthesis (IWICPAS). Proceedings series:LNCS, Volume 4153, page 446-454, August 2006.”Reinforcement learning versus model predictive control: a comparison”.D. Ernst, M. Glavic, F. Capitanescu and L. Wehenkel. Submitted.

“Planning under uncertainties by solving non-linear optimization

problems: a complexity analysis”. D. Ernst, B. Defourny and L.

Wehenkel. In preparation.


computing near-optimal policies from trajectories by solving a sequence of standard supervised...

Engineering