direct policy search

21
0. What is Direct Policy Search ? 1. Direct Policy Search: Parametric Policies for Financial Applications 2. Parametric Bellman values for Stock Problems 3. Direct Policy Search: Optimization Tools DIRECT POLICY SEARCH

Upload: olivier-teytaud

Post on 01-Jul-2015

817 views

Category:

Technology


0 download

DESCRIPTION

Direct Policy Search (in short)+ discussions of applications to stock problems

TRANSCRIPT

Page 1: Direct policy search

0. What is Direct Policy Search ?

1. Direct Policy Search: Parametric Policies for Financial Applications

2. Parametric Bellman values for Stock Problems

3. Direct Policy Search: Optimization Tools

DIRECT POLICY SEARCH

Page 2: Direct policy search

First, you need to know what is direct policy search (DPS).

Principle of DPS:

(1) Define a parametric policy Piwith parameters t1,...,tk.

(2) maximize(t1,...,tk) → average reward when applyingPolicy pi(t1,...,tk) on the problem.

==> You must define Pi==> You must choose a noisy optimization algorithm

==> There is a Pi by default (an actor neural network),but it's only a default solution (overload it)

Page 3: Direct policy search

Strengths of DPS:

- Good warm startIf I have a solution for problem A, and if I switch to problem B close to A, then I quicklyget good results.

- Benefits from expert knowledge on the structure

- No constraint on the structure of the objective function - Anytime (i.e. not that bad in restricted time)

Drawbacks:- needs structured direct policy search

- not directly applicable to partial observation

Page 4: Direct policy search

Virtual MashDecision computeDecision(MashState & state,Const Vector<double> params)

==> “params” = t1,...,tk==> returns the decision pi(t1,...,tk,state)

Does it make sense ?

Overload this function, and DPS is ready to work.

Well, DPS (somewhere between alpha and beta)might be full of bugs :-)

Page 5: Direct policy search

Direct Policy Search:Parametric Policies for Financial

Application

Page 6: Direct policy search

Bengio et al papers on DPS for financial applications

Stocks (various assets) + Cash

decision = tradingUnit(A, prevision(B,data))

Where:- tradingUnit is designed by human experts- prevision's outputs are chosen

by human experts- prevision is a neural network- A and B are parameters

Then, B is optimized by LMS (prevision criterion) ==> poor results, little correlation between LMS and financial performanceA and B are optimized on the expected return (by DPS) ==> much better

- Can be applied on data sets (no simulator, no elasticity model) because policy has no impact on prices

- 22 params in first paper

- reduced weight sharingin other paper ==> ~ 800 parameters

(if I understand correctly)

- there exist much bigger DPS (Sigaud et al., 27 000)

- nb: noisy optimization

Page 7: Direct policy search

An alternate solution:

parametric Bellman values

for Stock Problems

Page 8: Direct policy search

What is a Bellman function ?

V(s): expected benefit, in the future,if playing optimally from state s.

V(s) is useful for playing optimally.

Page 9: Direct policy search

Rule for an optimal decision:

d(s) = argmax V(s') + r(s,d)d

- s'=nextState(s,d) - d(s): optimal decision in state s- V(s'): Bellman value in state s'- r(s,d): reward associated to

decision d in state s

Page 10: Direct policy search

Remark 1: V(s) known up to an additive constant is enough

Remark 2: dV(s)/d(si)is the price of stock i

Example with one stock, soon.

Page 11: Direct policy search

Q-rule for an optimal decision:

d(s) = argmax Q(s,d)d

- d(s): optimal decision in state s- Q(s,d) : optimal future reward if

decision = d in s

==> approximate Q instead of V ==> we don't need r(s,d)

nor newState(s,d)

Page 12: Direct policy search

Stock (in kWh)

V(stock) (in euros)

I need alot of stock!

I accept to pay a lot.

I have enough stock;

I pay only if it's cheap.

Slope = marginal price (euros/KWh)

Page 13: Direct policy search

Examples:

For one stock:- very simple: constant price- piecewise linear (can ensure convexity)- “tanh” function

- neural network, SVM, sum of Gaussians...

For several stocks:- each stock separately- 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)

where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...

Page 14: Direct policy search

How to choose coefficients ?

- dynamic programming: robust, but slow in high dim - direct policy search:

- initializing coefficients from expert advice- or: supervised machine learning for approximating

an expert advice==> and then optimize

Page 15: Direct policy search

Conclusions:

V: Very convenient representation of policy: we can view prices.

Q: some advantages (model-free models)

Yet, less readable than direct rules.

And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be

a simple sort (as a first approximation).

Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it==> transformation “constants → parameters” ==> DPS

Page 16: Direct policy search

Questions (strategic decisions for the DPS):- start with Adrien's policy, improve it, generalize it,

parametrize it ? interface with ARM ?- or another strategy ?- or a parametric V function, and we assume we have

r(s,d) and newState(s,d) (often true)- or a parametric Q function ?

(more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )

Further work:- finish the validation of Adrien's policy on stock

(better than random as a policy; better than random as a UCT-Monte-Carlo)

- generalize ? variants ? - introduce into DPS, compare to the baseline (neural net)- introduce DPS's result into MCTS

Page 17: Direct policy search

Questions (strategic decisions for the DPS):- start with Adrien's policy, improve it, generalize it,

parametrize it ? interface with ARM ?- or another strategy ?- or a parametric V function, and we assume we have

r(s,d) and newState(s,d) (often true)- or a parametric Q function ?

(more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )

Further work:- finish the validation of Adrien's policy on stock

(better than random as a policy; better than random as a UCT-Monte-Carlo)

- generalize ? variants ? - introduce into DPS, compare to the baseline (neural net)- introduce DPS's result into MCTS

Page 18: Direct policy search

Direct Policy Search:

Optimization Tools

& Optimization Tricks

Page 19: Direct policy search

- Classical tools: Evolution Strategies,Cross-Entropy, Pso, ...==> more or less supposed to be

robust to local minima==> no gradient==> robust to noisy objective function==> weak for high dimension (but: see locality, next slide)

- Hopefully:- good initialization: nearly convex- random seeds: no noise

==> NewUoa is my favorite choice- no gradient- can “really” work in high-dimension- update rule surprisingly fast- people who try to show that their

algorithm is better than NewUoasuffer a lot in noise-free case

Page 20: Direct policy search

Improvements of optimization algorithms:

- active learning: when optimization on scenarios,choose “good” scenarios

==> maybe “quasi-randomization” ? Just choosing a representative sample ofscenarios. ==> simple, robust...

- local improvement: when a gradient step/updateis performed, only update variables concernedby the simulation you've used for generatingthe update

==> difficult to use in NewUoa

Page 21: Direct policy search

Roadmap:

- default policy for energy management problems:test, generalize, formalize, simplify...

- this default policy ==> a parametric policy

- test in DPS: strategy A

- interface DPS with NewUoa and/or others (openDP opt?)

- Strategy A: test into MCTS ==> Strategy B

==> IMHO, strategy A = good tool for fastreadable non-myopic results

==> IMHO, strategy B = good for combining A withthe efficiency of A for short term combinatorial effects.

- Also, validating the partial observation (sounds good).