from perturbation analysis to a new paradigm of optimization · from perturbation analysis to a new...

From Perturbation Analysis

to a New Paradigm of Optimization

Xi-Ren CaoShanghai Jiao Tong University

The Problem in Optimization

1

Policy Space: Best Policy?

D

Policy space too large for exhaustive search

(100 states, 2 actions 2100

=1030

policies, 10Gh ->1012

yrs to count)

State space too large, we cannot analyze every policy

2

Perturbation Analysis (PA)- gradient-based approach

With special structure, by analyzing one policy, obtain performance of its neighboring policies Performance gradient

Queuing networks, Markov processes

q+Dq

q

gradient hill climbing

3

Policies in Distance?

With special structure, by analyzing one policy, find a better policy in the distance Policy Iteration (PI): Discrete version of PA

4

Continuous Discrete

➢ Performance derivatives

q

d

d???

➢ Find the best direction

➢ Hill climbing

➢ Gradient <=0 Local optimal

➢ Performance difference

' ???

➢ Find a better policy

➢ Policy Iteration

➢ No better policy Global optimal

Perturbation Analysis (PA) Relative Optimization (RO)

PA and RO are probably the only way to overcomedifficulties in exhaustive search.

(PDF)

(policy iteration)

▪ Discrete Policy Space

A Sensitivity-Based View of Optimization

q+Dq

q

(perturbation analysis)

▪ Continuous Space

5

Dynamic ProgrammingWorking locally in time and states:

optimal policy at k+1 optimal policy at k

6

k0 1 2 3

X(k)

1

2

3

4

5

6

a1

a2

a3

+1

),()()(K

ki

d

K

d

i

d

idk XFXfx

)5(*2

)2(*2

)4(*2

Problem: Under selectivity for time non-homogeneousLong-run average:

Does not depend on polices in any finite period

}|)({1

lim)( 0

1

0

xXXfEK

xK

kkk

N

X(t)

t+Dtt

Stochastic Control

)()]([)]([)( tdWtXdttXtdX dd +

Problem: Non-smooth value functionLocal property leads to a differential equation Does not work for non-smooth value function

(viscosity solution)7

8

Dynamic Programming:➢ Works backwards in time t+Dt t

➢ Local information

Any Weakness?➢ Not convenient for long-run average

➢ Not for Non-smooth value functions

(viscosity solution)➢ Degenerate processes not well explored

➢ Not necessary: under selectivity issue:

(long-run average not depend on transient actions)

9

Sensitivity Based – PA and RO

PA: Given a sample path X perturbed sample path X(d)

i

bX

Xd

X Xd

T

k0 1 2 3 4 5

d

d’K

AF

E

DC

B

J

I

H

G

Two Performance measures:

Total Reward on ABCDEF

Total Reward on AGHIJK

)(0 xd

)('0 xd

Two Polices: d and d’ xXX dd )0()0( '

10

Relative Optimization: Comparing two policies

Relative Optimization

11

Two Policies:

,..,..,, 21 kPPP ,..',..,',' 21 kPPP,..,..,, 21 kfff ,..',..,',' 21 kfff,..,..,, 21 kggg ,..',..,',' 21 kgggValue function

'

Time non-homogeneous Markov chains:Trans. Prob. Matrices: Reward: Long-run average:

Sijkk ijPP )]|([

)(ifk

,....2,1,0kT

kkk Sfff ))(),...1((:

Relative Optimization

12

+

xXXfgIPEK

K

kkkkk

K0

1

0

'|)'](')'[('1

lim'

Performance difference formula

+

xXXfgIPEK

K

kkkkk

K0

1

0

'|)']()[('1

lim

,' if

)]()[()](')'[( xfgIPxfgIP kkkkkk ++

for all x in S, and all k=0,1,2,… except for a finite period, or on a subsequence with,...., 21 kk .0lim

n

n k

n

HJB

13

)()]([)]([)( tdWtXdttXtdX dd +

+Tdd xtXTXFdssXfEx

0})(|))(())(({)(

Stochastic Control

Finite horizon optimization problem (stationary)

)},({max)(* xx dd

d Goal: .x

14

Ito formula: for a smooth function (x)

).()(2

1)()(])(|)]([{ 2 xxxxxtXtXE

dt

d +

Dynamic programming HJB equation

Ito-Tanaka formula: for a non-smooth function (x)

dtxxxxxtXtXdE )]()(2

1)()([])(|)]([{ 2 +

xXdtLEzz Tz + + )0('|)()]()([

dtztXdtLE X

z

2])(|0)([

where Z is the non-smooth point,

0)(' TLXz

)(z)(z+ : right-sided and left-sided derivative

local time

dt

dtdt 0lim

Derivatives??

15

PDF for a non-smooth value function (x)

++ xXdttXfhhET

)0('|]))(')(''2

1'[(''

0

2

xXTLEzz Xz + + )0('|)(')]()([ '

In addition to the HJB equation at smooth points,we need at the non-smooth points

)()( zz +

Relative Optimization- Based on Comparison

16

1. No viscosity solution is needed!2. The order in dt is at , dt

dtztXdtLE X

z

2])(|0)([

dt

dtdt 0lim

3. X(t) hits the non-smooth point z rarely, but each timeit hits it, the effect in dt is infinity. This cannot becaptured by derivatives.

Example: )()( tWtX

.)( xx 1. .0]0)0(|)([]0)0(|)([ XtdWEXtdE

.||)( xx 2. ]0)0(|)([ XtdE

dtXtWdE

2]0)0(||)(|[

Other Applications???

Global information in entire [0, T], or [0,inf]. Under-selectivity, and non-smoothness,

Degenerate processes explored in details Long-run average➢ State-classification➢ Bias optimality➢ Multi-class optimization

Insights for further research on control andstochastic processes➢ Local times on curves

No viscosity solution needed

Relative Optimization: (based on comparison of performance of any two policies)

18

Performance Optimization

Dynamic Prog.

Relative Opt.

HJB,etc.

Solutions

THANKS!

Example: Long-run Average

Transition prob. matrices P, P’ n * n

Long-run average , ’Steady-state prob. , ’ n- row vectorReward function f n-column vector

Two policieswith finite states

Poisson Equation: (I-P)g + e =f (1)

g: potential, n-column vector, e= (1,1, …, 1)^T n-column vector

Noting ’=’f, ’e=1, left-multiplying (1) with ’ yieldsPDF:

gPP )'(''

➢ ’> if P’g>Pg , Policy iteration

➢ P* is optimal, if P*g*>=Pg* for all P, HJB eqn!

Markov Decision Processes (MDPs) & Policy Iteration

gPPQg )'('''

P is optimal

P g > P g, for all P

^

^^ ^

1. ’> if P’g>Pg , with > for at least one component

2. Policy iteration: At any state find a policy P’ with P’g>Pg

4. Optimality Equations:

3. Improve performance iteratively,Stop when no improvement can be made

Action a in {a1, a2,...,aN}Deterministic:

Stochastic: Distribution of X(k+1).

Reward: Transition:

Terminating:

),( xkf a

)(xF

k K+1 time1

2

3

4

5

6

7

8

states

a1a2

a3

a4 A policy d: a=d(k,X(k))

)](,[)1( kXkkX a+

),,( xkf dUnder policy d :

)],(,[)1( kXkkX d+

The Optimization Problem

1

k0 1 2 3 4 5

A

F

E

D

C

B

Sample paths: ABCDEF, RSUV, states: 5,...,1,0),( kkX d

+1

)],([))(,()(K

ki

dddd

k KXFiXifx .)( xkX d

Total rewards from to xkX d )( )(KX d

A (Deterministic) Policy d

Optimization: },),(max{)(* dxx d

k

d

k

for all k, and x.

SR

U

V

2

k0 1 2 3 4 5

X(k)

1

2

3

4

5

6

7

8

F(6)

F(5)

F(3)

d1

d2

d3d1

d2

d3

Dynamic Programming

Working backwards in time horizontally:optimal policy at k+1 optimal policy at k

3

For stochastic systems, the mapping is replaced by a transition probability and performance is replaced by its mean

)|( xyPd

a

),()|(),()()|(),( *

1

*

1

** yxyPxkfyxyPxkf d

k

y

ad

k

y

dd

++ ++ a

,a .1,...,1,0 Kk,x

The Optimality Condition:

,a .1,...,1,0 Kk

)).,((),()),((),( *

1

**

1

* xkxkfxkxkf d

k

dd

k

d aa ++ ++

,x

(**)

4

k0 1 2 3 4 5

d

d’

K

AF

E

DC

B

J

I

H

G

d

d

dd

1

2

3

4

5

6

7

8

Adding auxiliary paths starting from sample path d’ at each time k, but following policy d

Every sample path has a total reward )].([ ' kX dd

k

'

7

k0 1 2 3 4 5

d

d’K

AF

E

DC

B

J

I

H

G

OQ

ML

R

P

S U

V

dd ' AGHIJK - ABCDEF

= (AGHIJK – AGHIJM) + (AGHIJM – AGHILQ)+ (AGHILQ – AGHOPQ) + (AGHOPQ – AGRSUV)+ (AGRSUV – ABCDEF)

= (JK – JM) + (IJM – ILQ) + (HILQ – HOPQ)+ (GHOPQ – GRSUV) + (AGRSUV – ABCDEF)

8

k0 1 2 3 4 5

d

d’K

A

F

E

DC

B

J

I

H

G

O Q

ML

R

P

SU

V

dd '= (JK – JM) + (IJM – ILQ) + (HILQ – HOPQ)

+ (GHOPQ – GRSUV) + (AGRSUV – ABCDEF)We have

)](),4([)](),4('[ MFJfKFJfJMJK ++

)]},4([),4({)]},4('[),4('{ JJfJJf KK ++

}{}{ RSUVGRHOPQGHGRSUVGHOPQ ++

))]}1(',1([))1(',1({))]}1(',1('[))1(',1('{ 22 XXfXXf dd ++ …… …… ……

))]}4(',4([))4(',4({))]}4(',4('[))4(',4('{ XXfXXf KK ++

9

Thus, we get the Performance difference formula (PDF)

' AGHIJK - ABCDEF

]

++ ++1

0

11 ))]}(',([))(',({))]}(',('[))(',('{K

k

kk kXkkXkfkXkkXkf

Optimality Condition:

)],,('[),(')],([),( 11 xkxkfxkxkf kk ++ ++

,'d .1,...,1,0 Kk,x

(**)This is the same as DP eq

10

Comparison

DP: ~Riemann Integrationlocal information at time k; derivative in continuous time

DC(PDF): ~ Lebesgue Integration Global information in the

entire horizon [0, K]. more than derivative

11

Application to Stochastic Control

from perturbation analysis to a new paradigm of optimization · from perturbation analysis to a new...

Documents