direct policy search
DESCRIPTION
Direct Policy Search (in short)+ discussions of applications to stock problemsTRANSCRIPT
0. What is Direct Policy Search ?
1. Direct Policy Search: Parametric Policies for Financial Applications
2. Parametric Bellman values for Stock Problems
3. Direct Policy Search: Optimization Tools
DIRECT POLICY SEARCH
First, you need to know what is direct policy search (DPS).
Principle of DPS:
(1) Define a parametric policy Piwith parameters t1,...,tk.
(2) maximize(t1,...,tk) → average reward when applyingPolicy pi(t1,...,tk) on the problem.
==> You must define Pi==> You must choose a noisy optimization algorithm
==> There is a Pi by default (an actor neural network),but it's only a default solution (overload it)
Strengths of DPS:
- Good warm startIf I have a solution for problem A, and if I switch to problem B close to A, then I quicklyget good results.
- Benefits from expert knowledge on the structure
- No constraint on the structure of the objective function - Anytime (i.e. not that bad in restricted time)
Drawbacks:- needs structured direct policy search
- not directly applicable to partial observation
Virtual MashDecision computeDecision(MashState & state,Const Vector<double> params)
==> “params” = t1,...,tk==> returns the decision pi(t1,...,tk,state)
Does it make sense ?
Overload this function, and DPS is ready to work.
Well, DPS (somewhere between alpha and beta)might be full of bugs :-)
Direct Policy Search:Parametric Policies for Financial
Application
Bengio et al papers on DPS for financial applications
Stocks (various assets) + Cash
decision = tradingUnit(A, prevision(B,data))
Where:- tradingUnit is designed by human experts- prevision's outputs are chosen
by human experts- prevision is a neural network- A and B are parameters
Then, B is optimized by LMS (prevision criterion) ==> poor results, little correlation between LMS and financial performanceA and B are optimized on the expected return (by DPS) ==> much better
- Can be applied on data sets (no simulator, no elasticity model) because policy has no impact on prices
- 22 params in first paper
- reduced weight sharingin other paper ==> ~ 800 parameters
(if I understand correctly)
- there exist much bigger DPS (Sigaud et al., 27 000)
- nb: noisy optimization
An alternate solution:
parametric Bellman values
for Stock Problems
What is a Bellman function ?
V(s): expected benefit, in the future,if playing optimally from state s.
V(s) is useful for playing optimally.
Rule for an optimal decision:
d(s) = argmax V(s') + r(s,d)d
- s'=nextState(s,d) - d(s): optimal decision in state s- V(s'): Bellman value in state s'- r(s,d): reward associated to
decision d in state s
Remark 1: V(s) known up to an additive constant is enough
Remark 2: dV(s)/d(si)is the price of stock i
Example with one stock, soon.
Q-rule for an optimal decision:
d(s) = argmax Q(s,d)d
- d(s): optimal decision in state s- Q(s,d) : optimal future reward if
decision = d in s
==> approximate Q instead of V ==> we don't need r(s,d)
nor newState(s,d)
Stock (in kWh)
V(stock) (in euros)
I need alot of stock!
I accept to pay a lot.
I have enough stock;
I pay only if it's cheap.
Slope = marginal price (euros/KWh)
Examples:
For one stock:- very simple: constant price- piecewise linear (can ensure convexity)- “tanh” function
- neural network, SVM, sum of Gaussians...
For several stocks:- each stock separately- 2-dimensional: V(s1,s2,s3)=V'(s1,S)+v''(s2,S)+v'''(s3,S)
where S=a1.s1+a2.s2+a3.s3 - neural network, SVM, sum of Gaussians...
How to choose coefficients ?
- dynamic programming: robust, but slow in high dim - direct policy search:
- initializing coefficients from expert advice- or: supervised machine learning for approximating
an expert advice==> and then optimize
Conclusions:
V: Very convenient representation of policy: we can view prices.
Q: some advantages (model-free models)
Yet, less readable than direct rules.
And expensive: we need one optimization for making the decision, for each time step of a simulation. ==> but this optimization can be
a simple sort (as a first approximation).
Simpler ? Adrien has a parametric strategy for stocks ==> we should see how to generalize it==> transformation “constants → parameters” ==> DPS
Questions (strategic decisions for the DPS):- start with Adrien's policy, improve it, generalize it,
parametrize it ? interface with ARM ?- or another strategy ?- or a parametric V function, and we assume we have
r(s,d) and newState(s,d) (often true)- or a parametric Q function ?
(more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )
Further work:- finish the validation of Adrien's policy on stock
(better than random as a policy; better than random as a UCT-Monte-Carlo)
- generalize ? variants ? - introduce into DPS, compare to the baseline (neural net)- introduce DPS's result into MCTS
Questions (strategic decisions for the DPS):- start with Adrien's policy, improve it, generalize it,
parametrize it ? interface with ARM ?- or another strategy ?- or a parametric V function, and we assume we have
r(s,d) and newState(s,d) (often true)- or a parametric Q function ?
(more generic, unusual but appealing, but neglects some existing knowledge r(s,d) and newState(s,d) )
Further work:- finish the validation of Adrien's policy on stock
(better than random as a policy; better than random as a UCT-Monte-Carlo)
- generalize ? variants ? - introduce into DPS, compare to the baseline (neural net)- introduce DPS's result into MCTS
Direct Policy Search:
Optimization Tools
& Optimization Tricks
- Classical tools: Evolution Strategies,Cross-Entropy, Pso, ...==> more or less supposed to be
robust to local minima==> no gradient==> robust to noisy objective function==> weak for high dimension (but: see locality, next slide)
- Hopefully:- good initialization: nearly convex- random seeds: no noise
==> NewUoa is my favorite choice- no gradient- can “really” work in high-dimension- update rule surprisingly fast- people who try to show that their
algorithm is better than NewUoasuffer a lot in noise-free case
Improvements of optimization algorithms:
- active learning: when optimization on scenarios,choose “good” scenarios
==> maybe “quasi-randomization” ? Just choosing a representative sample ofscenarios. ==> simple, robust...
- local improvement: when a gradient step/updateis performed, only update variables concernedby the simulation you've used for generatingthe update
==> difficult to use in NewUoa
Roadmap:
- default policy for energy management problems:test, generalize, formalize, simplify...
- this default policy ==> a parametric policy
- test in DPS: strategy A
- interface DPS with NewUoa and/or others (openDP opt?)
- Strategy A: test into MCTS ==> Strategy B
==> IMHO, strategy A = good tool for fastreadable non-myopic results
==> IMHO, strategy B = good for combining A withthe efficiency of A for short term combinatorial effects.
- Also, validating the partial observation (sounds good).