model-free monte carlo-like policy evaluation

University of Liège – University of Michigan

Modelfree Monte Carlolike Policy Evaluation

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

Journées MAS, Bordeaux, France

(August 31st – September 3rd 2010)

The material of this talk is based on a presentationgiven by Raphael Fonteneau at Cap 2010.

Patients

0 1 TEvaluation

?Therapy to evaluate

Introduction

● Discretetime stochastic optimal control problems arise in many fields (finance, medecine, engineering,...)

● Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a (near)optimal control policy

● When the system is accessible to experimentation, such an oracle can be based on a Monte Carlo (MC) approach

● In this paper, the only information is contained in a sample of onestep transitions of the system

● In this context, we propose a ModelFree Monte Carlo (MFMC) estimator of the performance of a given policy that mimics in some way the Monte Carlo estimator.

Problem statement

● We consider a discretetime system whose dynamics over T stages is given by

● All xt lie in a normed state space X, all u

t lie in a normed action

space U, wt are i.i.d. according to a probability distribution pW(.)

● An instantaneous reward is associated with the action u

t while being in state x

● A policy h: {0,...,T1} × X U is given, and we want to evaluate its performance.

Problem statement

● The expected return of the policy h when starting from an initial state x0 is given by

w1 wT−2wT−1

x2 xT−2 xT−1r0

r1 rT−2rT−1

Problem statement

● Problem: the functions f, ρ and pW(.) are unknown

● They are replaced by a sample of system transitions

where the pairs are arbitrary chosen and the pairs are determined by , where wl is drawn according to pW(.)

How to evaluate Jh(x0) in this context?

The Monte Carlo estimator

● We define the Monte Carlo estimator of the expected return of h when starting from the initial state x0:

xT−11

xT−1px2

p xT−2p

x21 xT−2

wT−21

wT−11

MC Estimator

∑t=0

1p∑i=1

∑t=0

w02 w1

2 wT−22 wT−1

wT−2p

wT−1p

w ti~pW .

x12 x2

2 xT−22

xT−12

r11 rT−2

rT−11

rT−22

rT−2p

rT−12

rT−1p

● We assume that the random variable Rh(x0) admits a finite variance

● The bias and variance of the Monte Carlo estimator are

● Here, the MC approach is not feasible, since the system is unknown

● We introduce the ModelFree Monte Carlo estimator

● From the sample of transitions, we build p sequences of different transitions of length T called ``broken trajectories''

● These broken trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h

● We average the cumulated returns over the p broken trajectories to compute an estimate of the expected return of h

● The algorithm has complexity O(npT) .

The Modelfree Monte Carlo estimator

Example with T=3, p=2, n=8

The Modelfree Monte Carlo estimator

● Assumption: the functions f, ρ and h are Lipschitz continuous

MFMC estimator: analysis

● The only information available on the system is gathered in a sample of n onestep transitions

● We define the random variable as follows:a

The set of pairs is arbitrary chosen,a

whereas the pairs are determined by where wl is drawn according to pW(.)

● is a realization of the random set .

● Distance metric ∆

● ksparsity

● denotes the distance of (x,u) to its kth nearest neighbor (using the distance ∆) in the sample

(x',u')The ksparsity can beseen as the smallest radius γ such that all∆balls in X×U of radius γ contain at least kelements from

● Expected value of the MFMC estimator

● Theorem

● Variance of the MFMC estimator

● Theorem

Illustration

● System

● pW(.) is uniform over W, T = 15, x0 = 0.5 .

Illustration

● Simulations for p = 10, n = 100 … 10 000, uniform grid

Monte Carlo estimatorModelfree Monte Carlo estimator

Illustration

● Simulations for p = 1 … 100, n = 10 000, uniform grid

Monte Carlo estimator

Modelfree Monte Carlo estimator

Conclusions and Future work

Conclusions

● We have proposed in this paper an estimator of the expected return of a policy in a modelfree setting, the MFMC estimator

● We have provided bounds on the bias and variance of the MFMC estimator

● The bias and variance of the MFMC estimator converge to the bias and variance of the MC estimator

Future work

● MFMC estimator in a direct policy search framework

● One could extend this approach to evaluate return distributions (and not only expected values). This could allow to develop ''safe'' policy search techniques based on Value at Risk (VaR) criteria.

model-free monte carlo-like policy evaluation

xt p xt

r1 p rt

w0 p wt

px2 p xt

x0 x1 p xt

r0 p r1

w1 p x1

x0 xt w0 w1 wt

Engineering

monte carlo simulation - statvision.com carlo...

monte carlo methods - rupam mahmood...is extended to the...

jetform:836 - mcnp.lanl.gov · 06-7094 monte carlo...

method? monte carlo methods two basic principles monte carlo

low discrepancy sequences and quasi-monte carlo...

monte carlo simulations - desy · 2010-08-18 · h. jung,...

quasi-monte carlo variational inferencequasi-monte carlo...

quantum monte carlo for electronic...

monte carlo standard errors for markov chain monte...

monte carlo methods - ethz.ch · 1 monte carlo methods...

14 monte carlo...

fatin sezgin - mcqmc2010 - monte carlo and quasi-monte carlo

monte carlo particle transport: algorithm and performance...

quantum monte carlo -...

monte` carlo methods 1 monte` carlo methods integration and...

monte carlo and kinetic monte carlo methods – a...

classical monte carlo & metropolis...

chapter 4 monte carlo methods and simulation. monte carlo...

monte carlo and kinetic monte carlo methods – a...

monte carlo treatment planning - radiation...