continuous-time mean-variance portfolio selection:...

Continuous-time mean-variance portfolioselection: A reinforcement learning framework

(Joint work with X.-Y. Zhou (Columbia))

Workshop on Fintech and Machine LearningIMS/RMI, NUS

August 2019

Haoran WangColumbia University

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 1 / 42

Related Works

I Exploration versus exploitation in reinforcement learning: A stochasticcontrol approach (with T. Zariphopoulou and X.-Y. Zhou), arXiv,submitted, 2019.

I Continuous-time mean-variance portfolio selection: A reinforcementlearning framework (with X.-Y. Zhou), arXiv, submitted, 2019.

I Large scale continuous-time mean-variance portfolio allocation viareinforcement learning, arXiv, 2019.


Motivations

Formulation

The Exploratory Mean-Variance (MV) problem

Effect and Cost of Exploration

The EMV Algorithm

Large Scale Empirical Tests

Conclusions


Motivations

Reinforcement Learning meets quantitative finance

I Reinforcement learning (RL): an active and fast developing subarea inmachine learning

I An RL agent does not pre-specify a structural model; she learns thebest strategies based on trial and error, through interactions with theblack-box environment (e.g. the market)

I This is in direct contrast with econometric methods orsupervised/unsupervised learning methods commonly used inquantitative finance research

I Agent’s actions (controls) serve both as a means to explore (learn)and a way to exploit (optimize)

I A natural and crucial question: trade-off between exploration ofuncharted territory and exploitation of existing knowledge


Motivations

LiteratureI Extensive studies on trading off exploitation and exploration

I multi-armed bandit problem: Gittins-index (Gittins (1974)), Thompsonsampling (Thompson (1933)), theoretical optimality (Russo and VanRoy (2013,2014)), ...

I general RL problems: Brafman and Tennenholtz (2002), Strehl andLittman (2008), Strehl et al. (2009), ...

I Most existing works do not include exploration into optimization objectiveI Entropy-regularized RL formulation in discrete time: explicitly incorporates

exploration into optimization objective, with a trade-off weight on theentropy of exploration strategy (Ziebart et al. (2008), Nachum et al.(2017a), Fox et al. (2015), ...)

I Applications of RL to quantitative finance mostly focus on risk-neutraldecision making, including optimal execution (Nevmyvaka et al. (2006),Hendricks and Wilcox (2014)), portfolio management (Moody and Saffell(2001), Moody et al. (1998)), ...


Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem


Motivations

This Paper...


I Continuous control (action) and state (feature) spaces

I Model situations in which agents can interact with markets atultra-high frequency aided by modern computing resources (e.g. highfrequency trading)





Motivations

This Paper...


I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)





Formulation

Classical Stochastic ControlI (Ω,F ,P; Ftt≥0), an Ftt≥0-Brownian motion W = Wt , t ≥ 0I Action space U: representing constraints on an agent’s decisions

(“controls” or “actions”)I Admissible control u = ut , t ≥ 0: an Ftt≥0-adapted measurable

process taking value in UI U : the set of all admissible controlsI State (or “feature”) dynamics

dxut = b(xu

t , ut)dt + σ(xut , ut)dWt , t > 0 (1)

I Objective: to achieve maximum expected total discounted rewardrepresented by the value function

w (x) := supU

E[∫ ∞

0e−ρtr (xu

t , ut) dt∣∣∣∣ xu

0 = x], (2)

where r is the reward function and ρ > 0 is the discount rateI Dynamic programming when the model is fully known (Fleming and

Soner (1992), Yong and Zhou (1998))Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 7 / 42

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributionsI The agent executes a control for N rounds, while at each round, a

classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))


Formulation

Exploration in RL



I Notion of controls extended to distributions

I The agent executes a control for N rounds, while at each round, aclassical control is sampled from the distribution π

I The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))


Formulation

Exploration in RL




classical control is sampled from the distribution π

I The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))


Formulation

Exploration in RL




classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is large

I Policy evaluation (Sutton and Barto (2018))


Formulation

Exploration in RL




classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))


Formulation

Policy EvaluationI Let’s explain for special case when r(xu

t , ut) = r(ut) (known ascontinuous-armed bandit problem)

I Consider N identical independent rounds of control problem: at roundi , i = 1, 2, . . . ,N, a control ui is sampled under π, and executed forits corresponding copy of the control problem (1)–(2)

I At each t, from the law of large numbers (and under certain mildtechnical conditions), the average reward over [t, t + ∆t] is

1N

N∑i=1

e−ρtr(uit)∆t a.s.−−−→ E

[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞I This suggests that the reward function under exploration should be

revised toE[

e−ρt∫

Ur(u)πt(u)du

]


Formulation





1N

N∑i=1


[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞

I This suggests that the reward function under exploration should berevised to

E[

e−ρt∫

Ur(u)πt(u)du

]


Formulation





1N

N∑i=1


[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞I This suggests that the reward function under exploration should be

revised toE[

e−ρt∫

Ur(u)πt(u)du

]


Formulation

Exploratory Control Formulation (W., Zariphopoulou, Zhou, 2019)

I In a similar fashion, we propose the exploratory state dynamics, a controlledstochastic differential equation (SDE)

dXπt = b(Xπ

t , πt)dt + σ(Xπt , πt)dWt , t > 0; Xπ

0 = x , (3)where

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (4)

and

σ(Xπt , πt) :=

√∫Uσ2 (Xπ

t , u)πt(u)du (5)

I Exploratory reward function

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du (6)

I This exploratory formulation coincides with the relaxed control in controlliterature: El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001),Yong and Zhou (1998), ...

I Such resurgence of relaxed control is motivated by exploration in RL


Formulation

Exploratory Control Formulation (W., Zariphopoulou, Zhou, 2019)

I In a similar fashion, we propose the exploratory state dynamics, a controlledstochastic differential equation (SDE)

dXπt = b(Xπ

t , πt)dt + σ(Xπt , πt)dWt , t > 0; Xπ

0 = x , (3)where

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (4)

and

σ(Xπt , πt) :=

√∫Uσ2 (Xπ

t , u)πt(u)du (5)

I Exploratory reward function

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du (6)

I This exploratory formulation coincides with the relaxed control in controlliterature: El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001),Yong and Zhou (1998), ...

I Such resurgence of relaxed control is motivated by exploration in RLHaoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 10 / 42

Formulation

Entropy-Regularized Stochastic ControlI If the model is fully known, then exploration is not needed and optimal

control distribution degenerates into Dirac under Roxin’s condition

I In the RL context we need to add a “regularization term” toencourage exploration

I Use Shanon’s differential entropy to measure the degree ofexploration:

H(πt) := −∫

Uπt(u) lnπt(u)du

where πt is a control distributionI Entropy-regularized value function

V (x) := supπ∈A(x)

E

[∫ ∞0

e−ρt

(∫U

r(

Xut , πt)πt (u) du − λ

∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameterI A(x): the set of admissible control distributions


Formulation


control distribution degenerates into Dirac under Roxin’s conditionI In the RL context we need to add a “regularization term” to

encourage exploration

I Use Shanon’s differential entropy to measure the degree ofexploration:

H(πt) := −∫

Uπt(u) lnπt(u)du



E

[∫ ∞0

e−ρt

(∫U

r(


∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)



Formulation



encourage explorationI Use Shanon’s differential entropy to measure the degree of

exploration:H(πt) := −

∫Uπt(u) lnπt(u)du

where πt is a control distribution

I Entropy-regularized value function


E

[∫ ∞0

e−ρt

(∫U

r(


∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)



Formulation








E

[∫ ∞0

e−ρt

(∫U

r(


∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameter

I A(x): the set of admissible control distributions


Formulation








E

[∫ ∞0

e−ρt

(∫U

r(


∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)



Formulation

Admissible Control Distributions

I B(U): the Borel algebra on UI P (U): the set of probability measures on U that are absolutely

continuous with respect to the Lebesgue measureI The admissible set A(x) contains all measure-valued processesπ = πt , t ≥ 0 satisfying:i) for each t ≥ 0, πt ∈ P(U) a.s.;ii) for each A ∈ B(U), πt (A) , t ≥ 0 is Ft-progressively measurable;iii) the stochastic differential equation (3) has a unique solutionXπ = Xπ

t , t ≥ 0 if π is applied;iv) the expectation on the right hand side of (7) is finite



A special case: the MV problem

I We focus on the 1-d case for notation simplicity, i.e., one risky assetand one riskless asset

I The price of the risky asset follows the geometric Brownian motion

dSt = St (µ dt + σ dWt) , 0 ≤ t ≤ T ,

with S0 = s > 0 being the initial price at t = 0, and µ ∈ R, σ > 0I The riskless asset has interest rate r > 0I In practice, the mean µ, volatility σ and the Sharpe ratio ρ = µ−r

σ canbe slowly time-varying unknown stochastic processes for the RLalgorithm to be designed



The classical MV problem

I Denote by xut , 0 ≤ t ≤ T the discounted wealth

I Denote by u = ut , 0 ≤ t ≤ T the discounted dollar value put in therisky asset at time t

I Under self-financing condition, the wealth process satisfies

dxut = σut(ρ dt + dWt), 0 ≤ t ≤ T ,

with an initial endowment being xu0 = x ∈ R

I The classical continuous time MV problem is to solve

minu

Var[xuT ]

s. t. E[xuT ] = z ,

with z being the targeted return chosen at t = 0



The solution technique

I By using a Lagrange multiplier 2w , the problem turns into solving

minu

E[(xuT )2]− z2 − 2w (E[xu

T ]− z)

= minu

E[(xuT − w)2]− (w − z)2

I The constraint E[xu∗T ] = z determines the value of w

I In practice, to implement the MV optimal allocation u∗t , t ∈ [0,T ], itrequires (real-time) estimation of µ, σ

I It is however challenging to have accurate estimate for the meanreturn vector (aka the mean-blur problem)

I The allocation is often sensitive to the input estimators, due to theinversion of the ill-conditioned variance-covariance matrix



The exploratory MV problemI The exploratory state dynamics follows from previous discussion as

dXπt = ρσµt dt + σ

√µ2

t + σ2t dWt ,

with Xπ = x and

µt :=∫R

uπt(u)du and σ2t :=

∫R

u2πt(u)du − µ2t .

being the mean and the variance processes of the control distributionπt , t ∈ [0,T ]

I The objective is to solve

infπE[

(XπT − w)2 + λ

∫ T

0

∫Rπt(u) lnπt(u)dudt

∣∣∣Xπ0 = x

]− (w − z)2,

with the constraint E[Xπ∗T ] = z



Dynamic Programming and HJB Equation

I Bellman’s principle of optimality

V (t, x ; w) = infπ∈A(t,x)

E

[V (s, Xπs ; w) + λ

∫ s

t

∫R

πv (u) lnπv (u)dudv∣∣Xπt = x

]for x ∈ R and 0 ≤ t < s ≤ T

I V satisfies Hamilton-Jacobi-Bellman (HJB) equation

vt (t, x ; w) + minπ∈P(R)

∫R

(12σ

2u2vxx (t, x ; w) + ρσuvx (t, x ; w) + λ lnπ(u))π(u)du = 0, (8)

with terminal condition v(T , x ; w) = (x − w)2 − (w − z)2



Optimal Control DistributionI Optimal control is obtained by maximizing r.h.s of HJB

I Note that π ∈ P (U) if and only if∫Uπ(u)du = 1 and π(u) ≥ 0 a.e. on U

I Solving the (constrained) maximization problem in HJB yields“feedback-type” optimizer

π∗(u; t, x,w) =

exp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))∫

Rexp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))

du

= N(

u∣∣− ρ

σ

vx (t, x)vxx (t, x ; w)

,λ

σ2vxx (t, x ; w)

), (9)

where N (u|α, β) denotes the Gaussian density function with meanα ∈ R and variance β > 0



Optimal Control DistributionI Optimal control is obtained by maximizing r.h.s of HJBI Note that π ∈ P (U) if and only if∫

Uπ(u)du = 1 and π(u) ≥ 0 a.e. on U

I Solving the (constrained) maximization problem in HJB yields“feedback-type” optimizer

π∗(u; t, x,w) =

exp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))∫

Rexp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))

du

= N(

u∣∣− ρ

σ

vx (t, x)vxx (t, x ; w)

,λ

σ2vxx (t, x ; w)

), (9)

where N (u|α, β) denotes the Gaussian density function with meanα ∈ R and variance β > 0



Verification

I By standard verification arguments, the value function isV (t, x ; w) = (x − w)2e−ρ2(T−t) + λρ2

4(T 2 − t2)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2

I The optimal control distribution is Gaussian, with densityπ∗(u; t, x ,w) = N

(u∣∣∣− ρ

σ (x − w) , λ2σ2 eρ2(T−t)

)I The optimal wealth process is the unique solution of the SDE

dX∗t = −ρ2(X∗t − w) dt +

√ρ2(

X∗t − w)2

+λ

2eρ2(T−t) dWt , X∗0 = x.

I The Lagrange multiplier is w = zeρ2T−xeρ2T−1



Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at TI At fixed t ∈ [0,T ], exploration variance decreases as σ increases;

exploration is less necessary in more random market environmentI There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian



Insights



2σ2 eρ2(T−t), is decreasing as t → T

I This indicates exploitation gradually dominates exploration as t → T ,consistent with the original objective with single cost at T

I At fixed t ∈ [0,T ], exploration variance decreases as σ increases;exploration is less necessary in more random market environment

I There is a perfect separation between exploration and exploitation




Insights




consistent with the original objective with single cost at T

I At fixed t ∈ [0,T ], exploration variance decreases as σ increases;exploration is less necessary in more random market environment





Insights





exploration is less necessary in more random market environment





Insights









Insights






I exploration is solely reflected by the variance of the Gaussian

I exploitation is solely reflected by the mean of the Gaussian



Insights









Equivalence of SolvabilityI Equivalence between solvability of exploratory MV problem and that

of its classical counterpart ...

I ... in the sense that the value function and optimal control of oneproblem lead immediately to those of the other

I The following two statements are equivalent(a) The function V (t, x ; w) = (x − w)2e−ρ

2(T−t) + λρ24

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,

(t, x) ∈ [0,T ]× R, is the value function of the exploratory MV problem, and the corresponding optimal feedbackcontrol is

π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).

(b) The function V cl(t, x ; w) = (x − w)2e−ρ2(T−t) − (w − z)2, (t, x) ∈ [0,T ]× R, is the value function of the

classical MV problem, and the corresponding optimal feedback control is

u∗(t, x ; w) = −ρ

σ(x − w).

I Moreover, the two problems have the same Lagrange multiplierw = zeρ2T−x

eρ2T−1.




of its classical counterpart ...I ... in the sense that the value function and optimal control of one

problem lead immediately to those of the other

I The following two statements are equivalent(a) The function V (t, x ; w) = (x − w)2e−ρ

2(T−t) + λρ24

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,


π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).



u∗(t, x ; w) = −ρ

σ(x − w).


eρ2T−1.




of its classical counterpart ...I ... in the sense that the value function and optimal control of one

problem lead immediately to those of the otherI The following two statements are equivalent

(a) The function V (t, x ; w) = (x − w)2e−ρ2(T−t) + λρ2

4

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,


π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).



u∗(t, x ; w) = −ρ

σ(x − w).


eρ2T−1.



Cost of Exploration

I Define the exploration cost as

Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R

π∗t (u) lnπ∗t (u)du dt

∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),

I The entropy of Gaussian N (· |µ, σ2) is ln(σ√

2πe)I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R

I Independent of specific (unknown) wealth dynamics, the initialwealth, and the expected target z



Cost of Exploration


Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R


∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),


2πe)

I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R




Cost of Exploration


Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R


∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),


2πe)I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R




Vanishing Exploration

I Assume either exploratory or classical MV problem is solvableI Then, for each (t, x ,w) ∈ [0,T ]× R× R,

limλ→0

π∗(·; t, x ; w) = δu∗(t,x ;w)(·) weakly

I Moreover,limλ→0|V (t, x ; w)− V cl(t, x ; w)| = 0


The EMV Algorithm

RL Algorithm for MV

I We design implementable reinforcement learning algorithm for solvingthe exploratory MV problem, the EMV algorithm

I The algorithm EMV consists of three concurrently ongoing processesI policy evaluationI policy improvementI a self-correcting scheme for learning the Lagrange multiplier w based

on stochastic approximationI The policy improvement process relies on a provable Policy

Improvement Theorem (PIT) for the continuous-time relaxedstochastic control problem

I A convergence result will also be provided


The EMV Algorithm

Policy Improvement Theorem

I It is an essential component of most RL algorithms, as it indicateshow one should update the policy based on the currently learnedvalue functions

I It has been proved for discrete-time RL problems with and withoutentropy regularization, and the continuous-time classical stochasticcontrol problems

I We provide PIT for continuous-time MV problem with both controlrelaxation and entropy regularization

I Let π(·; t, x ,w) be an arbitrary feedback policy that is admissibleI Define the value function under π as

V π(s, y ; w) = E[(X π

T − w)2 + λ∫ T

s∫R πt(u) lnπt(u)dudt

∣∣∣X πs = y

]− (w − z)2,

for (s, y) ∈ [0,T )× R


The EMV Algorithm

Policy Improvement Theorem (cont’d)I Suppose that the following conditions hold

I V π(t, x ; w) ∈ C 2,1([0,T )× R) ∩ C 0([0,T ]× R)I V π

xx (t, x ; w) > 0, for any (t, x) ∈ [0,T )× RI the feedback policy

π(u; t, x ,w) = N(

u∣∣∣− ρ

σ

V πx (t, x ; w)

V πxx (t, x ; w) ,

λ

σ2V πxx (t, x ; w)

)is admissible

I Then, it satisfies

V π(t, x ; w) ≤ V π(t, x ; w), (t, x) ∈ [0,T ]× R

I The PIT indicates one can focus on the Gaussian family for policesw.l.o.g.

I Gaussian family is closed under policy improvement updateHaoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 26 / 42

The EMV Algorithm

ConvergenceI A convergence result is provable if one

I applies the PIT to update the policyI selects the initial feedback policy π0 wisely

I Let π0(u; t, x ,w) = N (u|a(x − w), c1ec2(T−t)), with a, c2 ∈ R andc1 > 0

I Denote by πn(u; t, x ,w), n ≥ 1, (t, x) ∈ [0,T ]× R the sequence offeedback policies updated by PIT

I Denote by V πn (t, x ; w), n ≥ 1, (t, x) ∈ [0,T ]× R the sequence ofthe corresponding value functions

I Then,lim

n→∞πn(·; t, x ,w) = π∗(·; t, x ,w) weakly,

andlim

n→∞V πn (t, x ; w) = V (t, x ; w),

for any (t, x) ∈ [0,T ]× RHaoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 27 / 42

The EMV Algorithm

EMV Algorithm

I Theoretical results provide a good starting point in the policy space,and an update scheme for later policies

I The policy evaluation amounts to estimate the value function of agiven feedback policy

I We follow Doya (2000) and minimize the (squared) continuous-timeBellman’s error (aka temporal difference error)

δt := V πt + λ

∫Rπt(u) lnπt(u)du,

where V πt = V π(Xt+∆t ,t+∆t)−V π(Xt ,t)

∆t is the total derivative and ∆t isthe discretization step for the learning algorithm.


The EMV Algorithm

EMV Algorithm (cont’d)I Using samples, we aim to minimize

C(θ, φ) = 12

∑(ti ,xi )∈D

(V θ(ti , xi ) + λ

∫Rπφti (u) lnπφti (u)du

)2∆t

for parametrized V θ(t, x) and πφt (u)I The πφ is Gaussian, with variance taking the form c1ec2(T−t) and

entropy H(πφt ) = φ1 + φ2(T − t)I The value function takes the form

V θ(t, x) = (x − w)2e−θ3(T−t) + θ2t2 + θ1t + θ0

I Apply stochastic gradient decent, with easily computable gradient∇θC and ∇φC


The EMV Algorithm

EMV Algorithm (cont’d)I The PIT indicates the policy improvement as

π(u; t, x ,w) = N(

u∣∣∣− ρ

σ(x − w) , λ

2σ2 eθ3(T−t))

= N

u∣∣∣−

√2φ2λπ

e2φ1−1

2 (x − w) , 12π e2φ2(T−t)+2φ1−1

I Finally, we provide a self-correcting scheme for learning the Lagrange

multiplier w based on stochastic approximation

wn+1 = wn − αn

(1N

N∑k=1

xkT − z

),

with αn > 0, n ≥ 1, being the learning rate



Large scale empirical studyI Data and methods

I Portfolio consists of large number of stocks (d = 20, 50, 75, 100, etc)I Use all S&P 500 stocks price dataI Compare EMV with econometric methods (Black-Litterman,

Fama-French, Markowitz, etc) and deep learning based method – deepdeterministic policy gradient (DDPG)

I Our tests includeI two training and testing methods — batch and universal methodsI long and medium investment horizons — 10 years and 1 yearI monthly and daily rebalancing frequencyI with and without leverage constraints



Training methodsI Seed composition: randomly select d stocks from S&P 500 stocks

(with repetition) to compose 100 seedsI Each seed consists of training data, followed by testing dataI Universal training and testing

I Randomly select 1 seed for each episode during trainingI Randomly select 100 seeds for testing, then averageI Such method artificially generates randomness and tests the universal

applicability of algorithmsI Batch (off-line) RL training and testing

I Use 1 seed for all episodes during trainingI Use the same seed for testingI Repeat over 100 seeds, then average

I Rolling-horizon based training and testingI One-period ahead testing data is added to training set progressivelyI Most obsolete training data is discardedI Adopted only by the econometric methods for competitive performance



Test I: monthly rebalancing

I Training data: monthly price data over 08-31-1990 to 08-31-2000I Testing data: monthly price data over 09-29-2000 to 09-30-2010I Investment horizon: 10 yearsI Trading frequency: monthlyI Normalized initial wealth: x0 = 1I Target return: 23% annually (i.e. z = 8)I Gross leverage constraints

I L =∞, 100%, 150%, 200% for EMVI L = 200% for DDPGI L =∞ for all the other econometric methods



Test I: monthly rebalancing – Universal method

Figure 1: Investment performance comparison for 10 years horizon with monthlyrebalancing (d = 20).



Test I: monthly rebalancing – Batch RL method




Test I: monthly rebalancing – Universal v.s. Batch RL




Test II: daily rebalancing

I Training data: daily price data over 01-09-2017 to 01-08-2018I Testing data: daily price data over 01-09-2018 to 01-09-2019I Investment horizon: 1 yearsI Trading frequency: dailyI Normalized initial wealth: x0 = 1I Target return: 40% annuallyI Gross leverage constraints

I L = 100%, 150%, 200% for EMVI L = 200% for DDPG



Test II: daily rebalancing (cont’d)

Figure 4: Investment performance comparison for 1 year horizon with dailyrebalancing (d = 50).



Scaling up d – monthly rebalancing

Table 1: Annualized return (with Sharpe ratio) and training time corresponding todifferent d for 10 years horizon with monthly rebalancing.

d = 20 d = 60 d = 100

EMV (L=200%) 10.8% 0.31 hrs 11.2% 4.34 hrs 6.3% 1.53 hrs1

(0.797) (1.323 ) (1.627)

DDPG (L=200%) −300.1% 4.23 hrs 476.3% 5.32 hrs −653.4% 6.68 hrs(Unannualized) (−0.411) (0.359 ) (−0.432)

1Only 1000 episodes were trained, while the training episodes for other experiments were 20000.



Scaling up d – daily rebalancing

Table 2: Annualized return (with Sharpe ratio) and training time corresponding todifferent d for 1 years horizon with daily rebalancing.

d = 50 d = 75 d = 100

EMV (L=200%) 44.9% 1.36 hrs 33.0% 5.54 hrs 17.9% 1.63 hrs2

(1.347) (1.370) (1.124)

DDPG (L=200%) −189.6% 6.20 hrs −27.9% 8.45 hrs −640.6% 14.42 hrs(−0.096) (−0.012) (−0.219)

2Only 1000 episodes were trained, while the training episodes for other experiments were 20000.


Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins


Conclusions

Main Contributions


I A complete analysis for the MV problem in continuous time

I Optimal control distribution for balancing exploitation and explorationis Gaussian, with time-decaying variance

I Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep



Conclusions

Main Contributions



is Gaussian, with time-decaying variance

I Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep



Conclusions

Main Contributions



is Gaussian, with time-decaying varianceI Perfect separation between exploitation and exploration

I A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep



Conclusions

Main Contributions



is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunities

I Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep



Conclusions

Main Contributions



is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterized

I Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep



Conclusions

Main Contributions



is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theorem

I EMV algorithm outperforms both econometric method and deeplearning method by large margins


Conclusions

Main Contributions



is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep



Conclusions

Thank you.


continuous-time mean-variance portfolio selection:...

Documents