direct policy search

0. What is Direct Policy Search ?

1. Direct Policy Search: Parametric Policies for Financial Applications

2. Parametric Bellman values for Stock Problems

3. Direct Policy Search: Optimization Tools

DIRECT POLICY SEARCH

First, you need to know what is direct policy search (DPS).

Principle of DPS:

(1) Define a parametric policy Piwith parameters t1,...,tk.

(2) maximize(t1,...,tk) → average reward when applyingPolicy pi(t1,...,tk) on the problem.

==> You must define Pi==> You must choose a noisy optimization algorithm

==> There is a Pi by default (an actor neural network),but it's only a default solution (overload it)

Strengths of DPS:

- Good warm startIf I have a solution for problem A, and if I switch to problem B close to A, then I quicklyget good results.

- Benefits from expert knowledge on the structure

- No constraint on the structure of the objective function - Anytime (i.e. not that bad in restricted time)

Drawbacks:- needs structured direct policy search

- not directly applicable to partial observation

Virtual MashDecision computeDecision(MashState & state,Const Vector<double> params)

==> “params” = t1,...,tk==> returns the decision pi(t1,...,tk,state)

Does it make sense ?

Overload this function, and DPS is ready to work.

Well, DPS (somewhere between alpha and beta)might be full of bugs :-)

Direct Policy Search:Parametric Policies for Financial

Application

Bengio et al papers on DPS for financial applications

Stocks (various assets) + Cash

decision = tradingUnit(A, prevision(B,data))

Where:- tradingUnit is designed by human experts- prevision's outputs are chosen

by human experts- prevision is a neural network- A and B are parameters

Then, B is optimized by LMS (prevision criterion) ==> poor results, little correlation between LMS and financial performanceA and B are optimized on the expected return (by DPS) ==> much better

- Can be applied on data sets (no simulator, no elasticity model) because policy has no impact on prices

- 22 params in first paper

- reduced weight sharingin other paper ==> ~ 800 parameters

(if I understand correctly)

- there exist much bigger DPS (Sigaud et al., 27 000)

- nb: noisy optimization

An alternate solution:

parametric Bellman values

for Stock Problems

What is a Bellman function ?

V(s): expected benefit, in the future,if playing optimally from state s.

V(s) is useful for playing optimally.

Rule for an optimal decision:

d(s) = argmax V(s') + r(s,d)d

- s'=nextState(s,d) - d(s): optimal decision in state s- V(s'): Bellman value in state s'- r(s,d): reward associated to

decision d in state s

Remark 1: V(s) known up to an additive constant is enough

Remark 2: dV(s)/d(si)is the price of stock i

Example with one stock, soon.

Q-rule for an optimal decision:

d(s) = argmax Q(s,d)d

- d(s): optimal decision in state s- Q(s,d) : optimal future reward if

decision = d in s

==> approximate Q instead of V ==> we don't need r(s,d)

nor newState(s,d)

Stock (in kWh)

V(stock) (in euros)

I need alot of stock!

I accept to pay a lot.

I have enough stock;

I pay only if it's cheap.

Slope = marginal price (euros/KWh)

Examples:

For one stock:- very simple: constant price- piecewise linear (can ensure convexity)- “tanh” function

- neural network, SVM, sum of Gaussians...

For several stocks:- each stock separately- 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)

where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...

How to choose coefficients ?

- dynamic programming: robust, but slow in high dim - direct policy search:

- initializing coefficients from expert advice- or: supervised machine learning for approximating

an expert advice==> and then optimize

Conclusions:

V: Very convenient representation of policy: we can view prices.

Q: some advantages (model-free models)

Yet, less readable than direct rules.

And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be

a simple sort (as a first approximation).

Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it==> transformation “constants → parameters” ==> DPS

Questions (strategic decisions for the DPS):- start with Adrien's policy, improve it, generalize it,

parametrize it ? interface with ARM ?- or another strategy ?- or a parametric V function, and we assume we have

r(s,d) and newState(s,d) (often true)- or a parametric Q function ?

(more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )

Further work:- finish the validation of Adrien's policy on stock

(better than random as a policy; better than random as a UCT-Monte-Carlo)

- generalize ? variants ? - introduce into DPS, compare to the baseline (neural net)- introduce DPS's result into MCTS

Direct Policy Search:

Optimization Tools

& Optimization Tricks

- Classical tools: Evolution Strategies,Cross-Entropy, Pso, ...==> more or less supposed to be

robust to local minima==> no gradient==> robust to noisy objective function==> weak for high dimension (but: see locality, next slide)

- Hopefully:- good initialization: nearly convex- random seeds: no noise

==> NewUoa is my favorite choice- no gradient- can “really” work in high-dimension- update rule surprisingly fast- people who try to show that their

algorithm is better than NewUoasuffer a lot in noise-free case

Improvements of optimization algorithms:

- active learning: when optimization on scenarios,choose “good” scenarios

==> maybe “quasi-randomization” ? Just choosing a representative sample ofscenarios. ==> simple, robust...

- local improvement: when a gradient step/updateis performed, only update variables concernedby the simulation you've used for generatingthe update

==> difficult to use in NewUoa

Roadmap:

- default policy for energy management problems:test, generalize, formalize, simplify...

- this default policy ==> a parametric policy

- test in DPS: strategy A

- interface DPS with NewUoa and/or others (openDP opt?)

- Strategy A: test into MCTS ==> Strategy B

==> IMHO, strategy A = good tool for fastreadable non-myopic results

==> IMHO, strategy B = good for combining A withthe efficiency of A for short term combinatorial effects.

- Also, validating the partial observation (sounds good).

direct policy search

Technology