continuous-time mean-variance portfolio selection:...

88
Continuous-time mean-variance portfolio selection: A reinforcement learning framework (Joint work with X.-Y. Zhou (Columbia)) Workshop on Fintech and Machine Learning IMS/RMI, NUS August 2019 Haoran Wang Columbia University Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 1 / 42

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Continuous-time mean-variance portfolioselection: A reinforcement learning framework

(Joint work with X.-Y. Zhou (Columbia))

Workshop on Fintech and Machine LearningIMS/RMI, NUS

August 2019

Haoran WangColumbia University

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 1 / 42

Page 2: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Related Works

I Exploration versus exploitation in reinforcement learning: A stochasticcontrol approach (with T. Zariphopoulou and X.-Y. Zhou), arXiv,submitted, 2019.

I Continuous-time mean-variance portfolio selection: A reinforcementlearning framework (with X.-Y. Zhou), arXiv, submitted, 2019.

I Large scale continuous-time mean-variance portfolio allocation viareinforcement learning, arXiv, 2019.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 2 / 42

Page 3: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

Formulation

The Exploratory Mean-Variance (MV) problem

Effect and Cost of Exploration

The EMV Algorithm

Large Scale Empirical Tests

Conclusions

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 3 / 42

Page 4: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

Reinforcement Learning meets quantitative finance

I Reinforcement learning (RL): an active and fast developing subarea inmachine learning

I An RL agent does not pre-specify a structural model; she learns thebest strategies based on trial and error, through interactions with theblack-box environment (e.g. the market)

I This is in direct contrast with econometric methods orsupervised/unsupervised learning methods commonly used inquantitative finance research

I Agent’s actions (controls) serve both as a means to explore (learn)and a way to exploit (optimize)

I A natural and crucial question: trade-off between exploration ofuncharted territory and exploitation of existing knowledge

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 4 / 42

Page 5: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

Reinforcement Learning meets quantitative finance

I Reinforcement learning (RL): an active and fast developing subarea inmachine learning

I An RL agent does not pre-specify a structural model; she learns thebest strategies based on trial and error, through interactions with theblack-box environment (e.g. the market)

I This is in direct contrast with econometric methods orsupervised/unsupervised learning methods commonly used inquantitative finance research

I Agent’s actions (controls) serve both as a means to explore (learn)and a way to exploit (optimize)

I A natural and crucial question: trade-off between exploration ofuncharted territory and exploitation of existing knowledge

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 4 / 42

Page 6: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

Reinforcement Learning meets quantitative finance

I Reinforcement learning (RL): an active and fast developing subarea inmachine learning

I An RL agent does not pre-specify a structural model; she learns thebest strategies based on trial and error, through interactions with theblack-box environment (e.g. the market)

I This is in direct contrast with econometric methods orsupervised/unsupervised learning methods commonly used inquantitative finance research

I Agent’s actions (controls) serve both as a means to explore (learn)and a way to exploit (optimize)

I A natural and crucial question: trade-off between exploration ofuncharted territory and exploitation of existing knowledge

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 4 / 42

Page 7: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

Reinforcement Learning meets quantitative finance

I Reinforcement learning (RL): an active and fast developing subarea inmachine learning

I An RL agent does not pre-specify a structural model; she learns thebest strategies based on trial and error, through interactions with theblack-box environment (e.g. the market)

I This is in direct contrast with econometric methods orsupervised/unsupervised learning methods commonly used inquantitative finance research

I Agent’s actions (controls) serve both as a means to explore (learn)and a way to exploit (optimize)

I A natural and crucial question: trade-off between exploration ofuncharted territory and exploitation of existing knowledge

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 4 / 42

Page 8: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

Reinforcement Learning meets quantitative finance

I Reinforcement learning (RL): an active and fast developing subarea inmachine learning

I An RL agent does not pre-specify a structural model; she learns thebest strategies based on trial and error, through interactions with theblack-box environment (e.g. the market)

I This is in direct contrast with econometric methods orsupervised/unsupervised learning methods commonly used inquantitative finance research

I Agent’s actions (controls) serve both as a means to explore (learn)and a way to exploit (optimize)

I A natural and crucial question: trade-off between exploration ofuncharted territory and exploitation of existing knowledge

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 4 / 42

Page 9: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

LiteratureI Extensive studies on trading off exploitation and exploration

I multi-armed bandit problem: Gittins-index (Gittins (1974)), Thompsonsampling (Thompson (1933)), theoretical optimality (Russo and VanRoy (2013,2014)), ...

I general RL problems: Brafman and Tennenholtz (2002), Strehl andLittman (2008), Strehl et al. (2009), ...

I Most existing works do not include exploration into optimization objectiveI Entropy-regularized RL formulation in discrete time: explicitly incorporates

exploration into optimization objective, with a trade-off weight on theentropy of exploration strategy (Ziebart et al. (2008), Nachum et al.(2017a), Fox et al. (2015), ...)

I Applications of RL to quantitative finance mostly focus on risk-neutraldecision making, including optimal execution (Nevmyvaka et al. (2006),Hendricks and Wilcox (2014)), portfolio management (Moody and Saffell(2001), Moody et al. (1998)), ...

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 5 / 42

Page 10: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 6 / 42

Page 11: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spaces

I Model situations in which agents can interact with markets atultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 6 / 42

Page 12: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 6 / 42

Page 13: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 6 / 42

Page 14: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 6 / 42

Page 15: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Motivations

This Paper...

I Study trade-off between exploration and exploitation for RL in acontinuous-time mean-variance (MV) portfolio optimization setting

I Continuous control (action) and state (feature) spacesI Model situations in which agents can interact with markets at

ultra-high frequency aided by modern computing resources (e.g. highfrequency trading)

I Elegant and insightful results are possible once cast in continuoustime, thanks to the tools of stochastic calculus, differential equationsand stochastic control

I Design RL algorithm based on theoretical foundations and maintaininterpretability

I Compare our algorithm with other state-of-the-art methods forsolving the MV problem

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 6 / 42

Page 16: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Classical Stochastic ControlI (Ω,F ,P; Ftt≥0), an Ftt≥0-Brownian motion W = Wt , t ≥ 0I Action space U: representing constraints on an agent’s decisions

(“controls” or “actions”)I Admissible control u = ut , t ≥ 0: an Ftt≥0-adapted measurable

process taking value in UI U : the set of all admissible controlsI State (or “feature”) dynamics

dxut = b(xu

t , ut)dt + σ(xut , ut)dWt , t > 0 (1)

I Objective: to achieve maximum expected total discounted rewardrepresented by the value function

w (x) := supU

E[∫ ∞

0e−ρtr (xu

t , ut) dt∣∣∣∣ xu

0 = x], (2)

where r is the reward function and ρ > 0 is the discount rateI Dynamic programming when the model is fully known (Fleming and

Soner (1992), Yong and Zhou (1998))Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 7 / 42

Page 17: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributionsI The agent executes a control for N rounds, while at each round, a

classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 8 / 42

Page 18: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributionsI The agent executes a control for N rounds, while at each round, a

classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 8 / 42

Page 19: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributions

I The agent executes a control for N rounds, while at each round, aclassical control is sampled from the distribution π

I The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 8 / 42

Page 20: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributionsI The agent executes a control for N rounds, while at each round, a

classical control is sampled from the distribution π

I The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 8 / 42

Page 21: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributionsI The agent executes a control for N rounds, while at each round, a

classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is large

I Policy evaluation (Sutton and Barto (2018))

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 8 / 42

Page 22: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploration in RL

I Under RL setting when model is not known, the agent engagesexploration to interact with and learn the unknown environmentthrough trial and error

I This exploration modeled by a distribution of controlsπ = πt(u), t ≥ 0 over the control space U from which each “trial”is sampled

I Notion of controls extended to distributionsI The agent executes a control for N rounds, while at each round, a

classical control is sampled from the distribution πI The reward of such a policy becomes accurate enough when N is largeI Policy evaluation (Sutton and Barto (2018))

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 8 / 42

Page 23: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Policy EvaluationI Let’s explain for special case when r(xu

t , ut) = r(ut) (known ascontinuous-armed bandit problem)

I Consider N identical independent rounds of control problem: at roundi , i = 1, 2, . . . ,N, a control ui is sampled under π, and executed forits corresponding copy of the control problem (1)–(2)

I At each t, from the law of large numbers (and under certain mildtechnical conditions), the average reward over [t, t + ∆t] is

1N

N∑i=1

e−ρtr(uit)∆t a.s.−−−→ E

[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞I This suggests that the reward function under exploration should be

revised toE[

e−ρt∫

Ur(u)πt(u)du

]

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 9 / 42

Page 24: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Policy EvaluationI Let’s explain for special case when r(xu

t , ut) = r(ut) (known ascontinuous-armed bandit problem)

I Consider N identical independent rounds of control problem: at roundi , i = 1, 2, . . . ,N, a control ui is sampled under π, and executed forits corresponding copy of the control problem (1)–(2)

I At each t, from the law of large numbers (and under certain mildtechnical conditions), the average reward over [t, t + ∆t] is

1N

N∑i=1

e−ρtr(uit)∆t a.s.−−−→ E

[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞I This suggests that the reward function under exploration should be

revised toE[

e−ρt∫

Ur(u)πt(u)du

]

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 9 / 42

Page 25: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Policy EvaluationI Let’s explain for special case when r(xu

t , ut) = r(ut) (known ascontinuous-armed bandit problem)

I Consider N identical independent rounds of control problem: at roundi , i = 1, 2, . . . ,N, a control ui is sampled under π, and executed forits corresponding copy of the control problem (1)–(2)

I At each t, from the law of large numbers (and under certain mildtechnical conditions), the average reward over [t, t + ∆t] is

1N

N∑i=1

e−ρtr(uit)∆t a.s.−−−→ E

[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞

I This suggests that the reward function under exploration should berevised to

E[

e−ρt∫

Ur(u)πt(u)du

]

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 9 / 42

Page 26: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Policy EvaluationI Let’s explain for special case when r(xu

t , ut) = r(ut) (known ascontinuous-armed bandit problem)

I Consider N identical independent rounds of control problem: at roundi , i = 1, 2, . . . ,N, a control ui is sampled under π, and executed forits corresponding copy of the control problem (1)–(2)

I At each t, from the law of large numbers (and under certain mildtechnical conditions), the average reward over [t, t + ∆t] is

1N

N∑i=1

e−ρtr(uit)∆t a.s.−−−→ E

[e−ρt

∫U

r(u)πt(u)du∆t]

as N →∞I This suggests that the reward function under exploration should be

revised toE[

e−ρt∫

Ur(u)πt(u)du

]

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 9 / 42

Page 27: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploratory Control Formulation (W., Zariphopoulou, Zhou, 2019)

I In a similar fashion, we propose the exploratory state dynamics, a controlledstochastic differential equation (SDE)

dXπt = b(Xπ

t , πt)dt + σ(Xπt , πt)dWt , t > 0; Xπ

0 = x , (3)where

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (4)

and

σ(Xπt , πt) :=

√∫Uσ2 (Xπ

t , u)πt(u)du (5)

I Exploratory reward function

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du (6)

I This exploratory formulation coincides with the relaxed control in controlliterature: El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001),Yong and Zhou (1998), ...

I Such resurgence of relaxed control is motivated by exploration in RL

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 10 / 42

Page 28: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploratory Control Formulation (W., Zariphopoulou, Zhou, 2019)

I In a similar fashion, we propose the exploratory state dynamics, a controlledstochastic differential equation (SDE)

dXπt = b(Xπ

t , πt)dt + σ(Xπt , πt)dWt , t > 0; Xπ

0 = x , (3)where

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (4)

and

σ(Xπt , πt) :=

√∫Uσ2 (Xπ

t , u)πt(u)du (5)

I Exploratory reward function

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du (6)

I This exploratory formulation coincides with the relaxed control in controlliterature: El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001),Yong and Zhou (1998), ...

I Such resurgence of relaxed control is motivated by exploration in RL

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 10 / 42

Page 29: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploratory Control Formulation (W., Zariphopoulou, Zhou, 2019)

I In a similar fashion, we propose the exploratory state dynamics, a controlledstochastic differential equation (SDE)

dXπt = b(Xπ

t , πt)dt + σ(Xπt , πt)dWt , t > 0; Xπ

0 = x , (3)where

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (4)

and

σ(Xπt , πt) :=

√∫Uσ2 (Xπ

t , u)πt(u)du (5)

I Exploratory reward function

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du (6)

I This exploratory formulation coincides with the relaxed control in controlliterature: El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001),Yong and Zhou (1998), ...

I Such resurgence of relaxed control is motivated by exploration in RL

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 10 / 42

Page 30: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Exploratory Control Formulation (W., Zariphopoulou, Zhou, 2019)

I In a similar fashion, we propose the exploratory state dynamics, a controlledstochastic differential equation (SDE)

dXπt = b(Xπ

t , πt)dt + σ(Xπt , πt)dWt , t > 0; Xπ

0 = x , (3)where

b(Xπt , πt) :=

∫U

b (Xπt , u)πt(u)du, (4)

and

σ(Xπt , πt) :=

√∫Uσ2 (Xπ

t , u)πt(u)du (5)

I Exploratory reward function

r (Xπt , πt) :=

∫U

r (Xπt , u)πt(u)du (6)

I This exploratory formulation coincides with the relaxed control in controlliterature: El Karoui et al. (1987), Kurtz and Stockbridge (1998, 2001),Yong and Zhou (1998), ...

I Such resurgence of relaxed control is motivated by exploration in RLHaoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 10 / 42

Page 31: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Entropy-Regularized Stochastic ControlI If the model is fully known, then exploration is not needed and optimal

control distribution degenerates into Dirac under Roxin’s condition

I In the RL context we need to add a “regularization term” toencourage exploration

I Use Shanon’s differential entropy to measure the degree ofexploration:

H(πt) := −∫

Uπt(u) lnπt(u)du

where πt is a control distributionI Entropy-regularized value function

V (x) := supπ∈A(x)

E

[∫ ∞0

e−ρt

(∫U

r(

Xut , πt)πt (u) du − λ

∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameterI A(x): the set of admissible control distributions

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 11 / 42

Page 32: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Entropy-Regularized Stochastic ControlI If the model is fully known, then exploration is not needed and optimal

control distribution degenerates into Dirac under Roxin’s conditionI In the RL context we need to add a “regularization term” to

encourage exploration

I Use Shanon’s differential entropy to measure the degree ofexploration:

H(πt) := −∫

Uπt(u) lnπt(u)du

where πt is a control distributionI Entropy-regularized value function

V (x) := supπ∈A(x)

E

[∫ ∞0

e−ρt

(∫U

r(

Xut , πt)πt (u) du − λ

∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameterI A(x): the set of admissible control distributions

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 11 / 42

Page 33: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Entropy-Regularized Stochastic ControlI If the model is fully known, then exploration is not needed and optimal

control distribution degenerates into Dirac under Roxin’s conditionI In the RL context we need to add a “regularization term” to

encourage explorationI Use Shanon’s differential entropy to measure the degree of

exploration:H(πt) := −

∫Uπt(u) lnπt(u)du

where πt is a control distribution

I Entropy-regularized value function

V (x) := supπ∈A(x)

E

[∫ ∞0

e−ρt

(∫U

r(

Xut , πt)πt (u) du − λ

∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameterI A(x): the set of admissible control distributions

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 11 / 42

Page 34: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Entropy-Regularized Stochastic ControlI If the model is fully known, then exploration is not needed and optimal

control distribution degenerates into Dirac under Roxin’s conditionI In the RL context we need to add a “regularization term” to

encourage explorationI Use Shanon’s differential entropy to measure the degree of

exploration:H(πt) := −

∫Uπt(u) lnπt(u)du

where πt is a control distributionI Entropy-regularized value function

V (x) := supπ∈A(x)

E

[∫ ∞0

e−ρt

(∫U

r(

Xut , πt)πt (u) du − λ

∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameter

I A(x): the set of admissible control distributions

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 11 / 42

Page 35: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Entropy-Regularized Stochastic ControlI If the model is fully known, then exploration is not needed and optimal

control distribution degenerates into Dirac under Roxin’s conditionI In the RL context we need to add a “regularization term” to

encourage explorationI Use Shanon’s differential entropy to measure the degree of

exploration:H(πt) := −

∫Uπt(u) lnπt(u)du

where πt is a control distributionI Entropy-regularized value function

V (x) := supπ∈A(x)

E

[∫ ∞0

e−ρt

(∫U

r(

Xut , πt)πt (u) du − λ

∫U

πt (u) lnπt (u)du

)dt

∣∣∣∣ Xπ0 = x

](7)

where λ > 0 is an exogenous temperature parameterI A(x): the set of admissible control distributions

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 11 / 42

Page 36: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Formulation

Admissible Control Distributions

I B(U): the Borel algebra on UI P (U): the set of probability measures on U that are absolutely

continuous with respect to the Lebesgue measureI The admissible set A(x) contains all measure-valued processesπ = πt , t ≥ 0 satisfying:i) for each t ≥ 0, πt ∈ P(U) a.s.;ii) for each A ∈ B(U), πt (A) , t ≥ 0 is Ft-progressively measurable;iii) the stochastic differential equation (3) has a unique solutionXπ = Xπ

t , t ≥ 0 if π is applied;iv) the expectation on the right hand side of (7) is finite

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 12 / 42

Page 37: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

A special case: the MV problem

I We focus on the 1-d case for notation simplicity, i.e., one risky assetand one riskless asset

I The price of the risky asset follows the geometric Brownian motion

dSt = St (µ dt + σ dWt) , 0 ≤ t ≤ T ,

with S0 = s > 0 being the initial price at t = 0, and µ ∈ R, σ > 0I The riskless asset has interest rate r > 0I In practice, the mean µ, volatility σ and the Sharpe ratio ρ = µ−r

σ canbe slowly time-varying unknown stochastic processes for the RLalgorithm to be designed

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 13 / 42

Page 38: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

The classical MV problem

I Denote by xut , 0 ≤ t ≤ T the discounted wealth

I Denote by u = ut , 0 ≤ t ≤ T the discounted dollar value put in therisky asset at time t

I Under self-financing condition, the wealth process satisfies

dxut = σut(ρ dt + dWt), 0 ≤ t ≤ T ,

with an initial endowment being xu0 = x ∈ R

I The classical continuous time MV problem is to solve

minu

Var[xuT ]

s. t. E[xuT ] = z ,

with z being the targeted return chosen at t = 0

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 14 / 42

Page 39: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

The solution technique

I By using a Lagrange multiplier 2w , the problem turns into solving

minu

E[(xuT )2]− z2 − 2w (E[xu

T ]− z)

= minu

E[(xuT − w)2]− (w − z)2

I The constraint E[xu∗T ] = z determines the value of w

I In practice, to implement the MV optimal allocation u∗t , t ∈ [0,T ], itrequires (real-time) estimation of µ, σ

I It is however challenging to have accurate estimate for the meanreturn vector (aka the mean-blur problem)

I The allocation is often sensitive to the input estimators, due to theinversion of the ill-conditioned variance-covariance matrix

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 15 / 42

Page 40: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

The exploratory MV problemI The exploratory state dynamics follows from previous discussion as

dXπt = ρσµt dt + σ

õ2

t + σ2t dWt ,

with Xπ = x and

µt :=∫R

uπt(u)du and σ2t :=

∫R

u2πt(u)du − µ2t .

being the mean and the variance processes of the control distributionπt , t ∈ [0,T ]

I The objective is to solve

infπE[

(XπT − w)2 + λ

∫ T

0

∫Rπt(u) lnπt(u)dudt

∣∣∣Xπ0 = x

]− (w − z)2,

with the constraint E[Xπ∗T ] = z

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 16 / 42

Page 41: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Dynamic Programming and HJB Equation

I Bellman’s principle of optimality

V (t, x ; w) = infπ∈A(t,x)

E

[V (s, Xπs ; w) + λ

∫ s

t

∫R

πv (u) lnπv (u)dudv∣∣Xπt = x

]for x ∈ R and 0 ≤ t < s ≤ T

I V satisfies Hamilton-Jacobi-Bellman (HJB) equation

vt (t, x ; w) + minπ∈P(R)

∫R

(12σ

2u2vxx (t, x ; w) + ρσuvx (t, x ; w) + λ lnπ(u))π(u)du = 0, (8)

with terminal condition v(T , x ; w) = (x − w)2 − (w − z)2

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 17 / 42

Page 42: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Dynamic Programming and HJB Equation

I Bellman’s principle of optimality

V (t, x ; w) = infπ∈A(t,x)

E

[V (s, Xπs ; w) + λ

∫ s

t

∫R

πv (u) lnπv (u)dudv∣∣Xπt = x

]for x ∈ R and 0 ≤ t < s ≤ T

I V satisfies Hamilton-Jacobi-Bellman (HJB) equation

vt (t, x ; w) + minπ∈P(R)

∫R

(12σ

2u2vxx (t, x ; w) + ρσuvx (t, x ; w) + λ lnπ(u))π(u)du = 0, (8)

with terminal condition v(T , x ; w) = (x − w)2 − (w − z)2

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 17 / 42

Page 43: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Optimal Control DistributionI Optimal control is obtained by maximizing r.h.s of HJB

I Note that π ∈ P (U) if and only if∫Uπ(u)du = 1 and π(u) ≥ 0 a.e. on U

I Solving the (constrained) maximization problem in HJB yields“feedback-type” optimizer

π∗(u; t, x,w) =

exp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))∫

Rexp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))

du

= N(

u∣∣− ρ

σ

vx (t, x)vxx (t, x ; w)

σ2vxx (t, x ; w)

), (9)

where N (u|α, β) denotes the Gaussian density function with meanα ∈ R and variance β > 0

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 18 / 42

Page 44: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Optimal Control DistributionI Optimal control is obtained by maximizing r.h.s of HJBI Note that π ∈ P (U) if and only if∫

Uπ(u)du = 1 and π(u) ≥ 0 a.e. on U

I Solving the (constrained) maximization problem in HJB yields“feedback-type” optimizer

π∗(u; t, x,w) =

exp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))∫

Rexp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))

du

= N(

u∣∣− ρ

σ

vx (t, x)vxx (t, x ; w)

σ2vxx (t, x ; w)

), (9)

where N (u|α, β) denotes the Gaussian density function with meanα ∈ R and variance β > 0

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 18 / 42

Page 45: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Optimal Control DistributionI Optimal control is obtained by maximizing r.h.s of HJBI Note that π ∈ P (U) if and only if∫

Uπ(u)du = 1 and π(u) ≥ 0 a.e. on U

I Solving the (constrained) maximization problem in HJB yields“feedback-type” optimizer

π∗(u; t, x,w) =

exp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))∫

Rexp(− 1λ

(12σ

2u2vxx (t, x ; w) + ρσvx (t, x ; w)))

du

= N(

u∣∣− ρ

σ

vx (t, x)vxx (t, x ; w)

σ2vxx (t, x ; w)

), (9)

where N (u|α, β) denotes the Gaussian density function with meanα ∈ R and variance β > 0

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 18 / 42

Page 46: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Verification

I By standard verification arguments, the value function isV (t, x ; w) = (x − w)2e−ρ2(T−t) + λρ2

4(T 2 − t2)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2

I The optimal control distribution is Gaussian, with densityπ∗(u; t, x ,w) = N

(u∣∣∣− ρ

σ (x − w) , λ2σ2 eρ2(T−t)

)I The optimal wealth process is the unique solution of the SDE

dX∗t = −ρ2(X∗t − w) dt +

√ρ2(

X∗t − w)2

2eρ2(T−t) dWt , X∗0 = x.

I The Lagrange multiplier is w = zeρ2T−xeρ2T−1

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 19 / 42

Page 47: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at TI At fixed t ∈ [0,T ], exploration variance decreases as σ increases;

exploration is less necessary in more random market environmentI There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 48: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → T

I This indicates exploitation gradually dominates exploration as t → T ,consistent with the original objective with single cost at T

I At fixed t ∈ [0,T ], exploration variance decreases as σ increases;exploration is less necessary in more random market environment

I There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 49: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at T

I At fixed t ∈ [0,T ], exploration variance decreases as σ increases;exploration is less necessary in more random market environment

I There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 50: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at TI At fixed t ∈ [0,T ], exploration variance decreases as σ increases;

exploration is less necessary in more random market environment

I There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 51: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at TI At fixed t ∈ [0,T ], exploration variance decreases as σ increases;

exploration is less necessary in more random market environmentI There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 52: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at TI At fixed t ∈ [0,T ], exploration variance decreases as σ increases;

exploration is less necessary in more random market environmentI There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the Gaussian

I exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 53: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The Exploratory Mean-Variance (MV) problem

Insights

I The best control distribution to balance exploration and exploitationis Gaussian, similar as in the infinite horizon linear-quadratic problemstudied in our previous work Wang et al. (2019)

I The degree of exploration, characterized by the variance of theGaussian λ

2σ2 eρ2(T−t), is decreasing as t → TI This indicates exploitation gradually dominates exploration as t → T ,

consistent with the original objective with single cost at TI At fixed t ∈ [0,T ], exploration variance decreases as σ increases;

exploration is less necessary in more random market environmentI There is a perfect separation between exploration and exploitation

I exploration is solely reflected by the variance of the GaussianI exploitation is solely reflected by the mean of the Gaussian

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 20 / 42

Page 54: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Equivalence of SolvabilityI Equivalence between solvability of exploratory MV problem and that

of its classical counterpart ...

I ... in the sense that the value function and optimal control of oneproblem lead immediately to those of the other

I The following two statements are equivalent(a) The function V (t, x ; w) = (x − w)2e−ρ

2(T−t) + λρ24

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,

(t, x) ∈ [0,T ]× R, is the value function of the exploratory MV problem, and the corresponding optimal feedbackcontrol is

π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).

(b) The function V cl(t, x ; w) = (x − w)2e−ρ2(T−t) − (w − z)2, (t, x) ∈ [0,T ]× R, is the value function of the

classical MV problem, and the corresponding optimal feedback control is

u∗(t, x ; w) = −ρ

σ(x − w).

I Moreover, the two problems have the same Lagrange multiplierw = zeρ2T−x

eρ2T−1.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 21 / 42

Page 55: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Equivalence of SolvabilityI Equivalence between solvability of exploratory MV problem and that

of its classical counterpart ...I ... in the sense that the value function and optimal control of one

problem lead immediately to those of the other

I The following two statements are equivalent(a) The function V (t, x ; w) = (x − w)2e−ρ

2(T−t) + λρ24

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,

(t, x) ∈ [0,T ]× R, is the value function of the exploratory MV problem, and the corresponding optimal feedbackcontrol is

π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).

(b) The function V cl(t, x ; w) = (x − w)2e−ρ2(T−t) − (w − z)2, (t, x) ∈ [0,T ]× R, is the value function of the

classical MV problem, and the corresponding optimal feedback control is

u∗(t, x ; w) = −ρ

σ(x − w).

I Moreover, the two problems have the same Lagrange multiplierw = zeρ2T−x

eρ2T−1.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 21 / 42

Page 56: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Equivalence of SolvabilityI Equivalence between solvability of exploratory MV problem and that

of its classical counterpart ...I ... in the sense that the value function and optimal control of one

problem lead immediately to those of the otherI The following two statements are equivalent

(a) The function V (t, x ; w) = (x − w)2e−ρ2(T−t) + λρ2

4

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,

(t, x) ∈ [0,T ]× R, is the value function of the exploratory MV problem, and the corresponding optimal feedbackcontrol is

π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).

(b) The function V cl(t, x ; w) = (x − w)2e−ρ2(T−t) − (w − z)2, (t, x) ∈ [0,T ]× R, is the value function of the

classical MV problem, and the corresponding optimal feedback control is

u∗(t, x ; w) = −ρ

σ(x − w).

I Moreover, the two problems have the same Lagrange multiplierw = zeρ2T−x

eρ2T−1.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 21 / 42

Page 57: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Equivalence of SolvabilityI Equivalence between solvability of exploratory MV problem and that

of its classical counterpart ...I ... in the sense that the value function and optimal control of one

problem lead immediately to those of the otherI The following two statements are equivalent

(a) The function V (t, x ; w) = (x − w)2e−ρ2(T−t) + λρ2

4

(T 2 − t2

)− λ

2

(ρ2T − ln σ2

πλ

)(T − t)− (w − z)2,

(t, x) ∈ [0,T ]× R, is the value function of the exploratory MV problem, and the corresponding optimal feedbackcontrol is

π∗(u; t, x,w) = N

(u∣∣− ρ

σ(x − w) ,

λ

2σ2 eρ2(T−t)

).

(b) The function V cl(t, x ; w) = (x − w)2e−ρ2(T−t) − (w − z)2, (t, x) ∈ [0,T ]× R, is the value function of the

classical MV problem, and the corresponding optimal feedback control is

u∗(t, x ; w) = −ρ

σ(x − w).

I Moreover, the two problems have the same Lagrange multiplierw = zeρ2T−x

eρ2T−1.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 21 / 42

Page 58: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Cost of Exploration

I Define the exploration cost as

Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R

π∗t (u) lnπ∗t (u)du dt

∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),

I The entropy of Gaussian N (· |µ, σ2) is ln(σ√

2πe)I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R

I Independent of specific (unknown) wealth dynamics, the initialwealth, and the expected target z

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 22 / 42

Page 59: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Cost of Exploration

I Define the exploration cost as

Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R

π∗t (u) lnπ∗t (u)du dt

∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),

I The entropy of Gaussian N (· |µ, σ2) is ln(σ√

2πe)

I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R

I Independent of specific (unknown) wealth dynamics, the initialwealth, and the expected target z

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 22 / 42

Page 60: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Cost of Exploration

I Define the exploration cost as

Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R

π∗t (u) lnπ∗t (u)du dt

∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),

I The entropy of Gaussian N (· |µ, σ2) is ln(σ√

2πe)I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R

I Independent of specific (unknown) wealth dynamics, the initialwealth, and the expected target z

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 22 / 42

Page 61: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Cost of Exploration

I Define the exploration cost as

Cu∗,π∗ (0, x ; w) :=

(V (0, x ; w)− λE

[∫ T

0

∫R

π∗t (u) lnπ∗t (u)du dt

∣∣∣∣ Xπ∗

0 = x

])−V cl(0, x ; w),

I The entropy of Gaussian N (· |µ, σ2) is ln(σ√

2πe)I Hence for the MV problem

Cu∗,π∗ (0, x ; w) = V (0, x ; w) +λ

2

∫ T

0

ln(πeλσ2 eρ

2(T−t))

dt − V cl(0, x ; w) =λT

2, x ∈ R, w ∈ R

I Independent of specific (unknown) wealth dynamics, the initialwealth, and the expected target z

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 22 / 42

Page 62: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Effect and Cost of Exploration

Vanishing Exploration

I Assume either exploratory or classical MV problem is solvableI Then, for each (t, x ,w) ∈ [0,T ]× R× R,

limλ→0

π∗(·; t, x ; w) = δu∗(t,x ;w)(·) weakly

I Moreover,limλ→0|V (t, x ; w)− V cl(t, x ; w)| = 0

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 23 / 42

Page 63: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

RL Algorithm for MV

I We design implementable reinforcement learning algorithm for solvingthe exploratory MV problem, the EMV algorithm

I The algorithm EMV consists of three concurrently ongoing processesI policy evaluationI policy improvementI a self-correcting scheme for learning the Lagrange multiplier w based

on stochastic approximationI The policy improvement process relies on a provable Policy

Improvement Theorem (PIT) for the continuous-time relaxedstochastic control problem

I A convergence result will also be provided

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 24 / 42

Page 64: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

Policy Improvement Theorem

I It is an essential component of most RL algorithms, as it indicateshow one should update the policy based on the currently learnedvalue functions

I It has been proved for discrete-time RL problems with and withoutentropy regularization, and the continuous-time classical stochasticcontrol problems

I We provide PIT for continuous-time MV problem with both controlrelaxation and entropy regularization

I Let π(·; t, x ,w) be an arbitrary feedback policy that is admissibleI Define the value function under π as

V π(s, y ; w) = E[(X π

T − w)2 + λ∫ T

s∫R πt(u) lnπt(u)dudt

∣∣∣X πs = y

]− (w − z)2,

for (s, y) ∈ [0,T )× R

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 25 / 42

Page 65: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

Policy Improvement Theorem (cont’d)I Suppose that the following conditions hold

I V π(t, x ; w) ∈ C 2,1([0,T )× R) ∩ C 0([0,T ]× R)I V π

xx (t, x ; w) > 0, for any (t, x) ∈ [0,T )× RI the feedback policy

π(u; t, x ,w) = N(

u∣∣∣− ρ

σ

V πx (t, x ; w)

V πxx (t, x ; w) ,

λ

σ2V πxx (t, x ; w)

)is admissible

I Then, it satisfies

V π(t, x ; w) ≤ V π(t, x ; w), (t, x) ∈ [0,T ]× R

I The PIT indicates one can focus on the Gaussian family for policesw.l.o.g.

I Gaussian family is closed under policy improvement updateHaoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 26 / 42

Page 66: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

ConvergenceI A convergence result is provable if one

I applies the PIT to update the policyI selects the initial feedback policy π0 wisely

I Let π0(u; t, x ,w) = N (u|a(x − w), c1ec2(T−t)), with a, c2 ∈ R andc1 > 0

I Denote by πn(u; t, x ,w), n ≥ 1, (t, x) ∈ [0,T ]× R the sequence offeedback policies updated by PIT

I Denote by V πn (t, x ; w), n ≥ 1, (t, x) ∈ [0,T ]× R the sequence ofthe corresponding value functions

I Then,lim

n→∞πn(·; t, x ,w) = π∗(·; t, x ,w) weakly,

andlim

n→∞V πn (t, x ; w) = V (t, x ; w),

for any (t, x) ∈ [0,T ]× RHaoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 27 / 42

Page 67: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

EMV Algorithm

I Theoretical results provide a good starting point in the policy space,and an update scheme for later policies

I The policy evaluation amounts to estimate the value function of agiven feedback policy

I We follow Doya (2000) and minimize the (squared) continuous-timeBellman’s error (aka temporal difference error)

δt := V πt + λ

∫Rπt(u) lnπt(u)du,

where V πt = V π(Xt+∆t ,t+∆t)−V π(Xt ,t)

∆t is the total derivative and ∆t isthe discretization step for the learning algorithm.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 28 / 42

Page 68: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

EMV Algorithm (cont’d)I Using samples, we aim to minimize

C(θ, φ) = 12

∑(ti ,xi )∈D

(V θ(ti , xi ) + λ

∫Rπφti (u) lnπφti (u)du

)2∆t

for parametrized V θ(t, x) and πφt (u)I The πφ is Gaussian, with variance taking the form c1ec2(T−t) and

entropy H(πφt ) = φ1 + φ2(T − t)I The value function takes the form

V θ(t, x) = (x − w)2e−θ3(T−t) + θ2t2 + θ1t + θ0

I Apply stochastic gradient decent, with easily computable gradient∇θC and ∇φC

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 29 / 42

Page 69: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

The EMV Algorithm

EMV Algorithm (cont’d)I The PIT indicates the policy improvement as

π(u; t, x ,w) = N(

u∣∣∣− ρ

σ(x − w) , λ

2σ2 eθ3(T−t))

= N

u∣∣∣−

√2φ2λπ

e2φ1−1

2 (x − w) , 12π e2φ2(T−t)+2φ1−1

I Finally, we provide a self-correcting scheme for learning the Lagrange

multiplier w based on stochastic approximation

wn+1 = wn − αn

(1N

N∑k=1

xkT − z

),

with αn > 0, n ≥ 1, being the learning rate

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 30 / 42

Page 70: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Large scale empirical studyI Data and methods

I Portfolio consists of large number of stocks (d = 20, 50, 75, 100, etc)I Use all S&P 500 stocks price dataI Compare EMV with econometric methods (Black-Litterman,

Fama-French, Markowitz, etc) and deep learning based method – deepdeterministic policy gradient (DDPG)

I Our tests includeI two training and testing methods — batch and universal methodsI long and medium investment horizons — 10 years and 1 yearI monthly and daily rebalancing frequencyI with and without leverage constraints

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 31 / 42

Page 71: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Training methodsI Seed composition: randomly select d stocks from S&P 500 stocks

(with repetition) to compose 100 seedsI Each seed consists of training data, followed by testing dataI Universal training and testing

I Randomly select 1 seed for each episode during trainingI Randomly select 100 seeds for testing, then averageI Such method artificially generates randomness and tests the universal

applicability of algorithmsI Batch (off-line) RL training and testing

I Use 1 seed for all episodes during trainingI Use the same seed for testingI Repeat over 100 seeds, then average

I Rolling-horizon based training and testingI One-period ahead testing data is added to training set progressivelyI Most obsolete training data is discardedI Adopted only by the econometric methods for competitive performance

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 32 / 42

Page 72: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Test I: monthly rebalancing

I Training data: monthly price data over 08-31-1990 to 08-31-2000I Testing data: monthly price data over 09-29-2000 to 09-30-2010I Investment horizon: 10 yearsI Trading frequency: monthlyI Normalized initial wealth: x0 = 1I Target return: 23% annually (i.e. z = 8)I Gross leverage constraints

I L =∞, 100%, 150%, 200% for EMVI L = 200% for DDPGI L =∞ for all the other econometric methods

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 33 / 42

Page 73: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Test I: monthly rebalancing – Universal method

Figure 1: Investment performance comparison for 10 years horizon with monthlyrebalancing (d = 20).

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 34 / 42

Page 74: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Test I: monthly rebalancing – Batch RL method

Figure 2: Investment performance comparison for 10 years horizon with monthlyrebalancing (d = 20).

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 35 / 42

Page 75: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Test I: monthly rebalancing – Universal v.s. Batch RL

Figure 3: Investment performance comparison for 10 years horizon with monthlyrebalancing (d = 20).

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 36 / 42

Page 76: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Test II: daily rebalancing

I Training data: daily price data over 01-09-2017 to 01-08-2018I Testing data: daily price data over 01-09-2018 to 01-09-2019I Investment horizon: 1 yearsI Trading frequency: dailyI Normalized initial wealth: x0 = 1I Target return: 40% annuallyI Gross leverage constraints

I L = 100%, 150%, 200% for EMVI L = 200% for DDPG

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 37 / 42

Page 77: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Test II: daily rebalancing (cont’d)

Figure 4: Investment performance comparison for 1 year horizon with dailyrebalancing (d = 50).

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 38 / 42

Page 78: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Scaling up d – monthly rebalancing

Table 1: Annualized return (with Sharpe ratio) and training time corresponding todifferent d for 10 years horizon with monthly rebalancing.

d = 20 d = 60 d = 100

EMV (L=200%) 10.8% 0.31 hrs 11.2% 4.34 hrs 6.3% 1.53 hrs1

(0.797) (1.323 ) (1.627)

DDPG (L=200%) −300.1% 4.23 hrs 476.3% 5.32 hrs −653.4% 6.68 hrs(Unannualized) (−0.411) (0.359 ) (−0.432)

1Only 1000 episodes were trained, while the training episodes for other experiments were 20000.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 39 / 42

Page 79: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Large Scale Empirical Tests

Scaling up d – daily rebalancing

Table 2: Annualized return (with Sharpe ratio) and training time corresponding todifferent d for 1 years horizon with daily rebalancing.

d = 50 d = 75 d = 100

EMV (L=200%) 44.9% 1.36 hrs 33.0% 5.54 hrs 17.9% 1.63 hrs2

(1.347) (1.370) (1.124)

DDPG (L=200%) −189.6% 6.20 hrs −27.9% 8.45 hrs −640.6% 14.42 hrs(−0.096) (−0.012) (−0.219)

2Only 1000 episodes were trained, while the training episodes for other experiments were 20000.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 40 / 42

Page 80: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 81: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous time

I Optimal control distribution for balancing exploitation and explorationis Gaussian, with time-decaying variance

I Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 82: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying variance

I Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 83: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and exploration

I A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 84: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunities

I Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 85: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterized

I Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 86: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theorem

I EMV algorithm outperforms both econometric method and deeplearning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 87: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Main Contributions

I An entropy-regularized reward function, and an “exploratoryformulation” for the state dynamics

I A complete analysis for the MV problem in continuous timeI Optimal control distribution for balancing exploitation and exploration

is Gaussian, with time-decaying varianceI Perfect separation between exploitation and explorationI A more random environment contains more learning opportunitiesI Cost of exploration is characterizedI Interpretability based on provable policy improvement theoremI EMV algorithm outperforms both econometric method and deep

learning method by large margins

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 41 / 42

Page 88: Continuous-time mean-variance portfolio selection: …ims.nus.edu.sg/events/2019/qfinance/files/haoran.pdfRelated Works I Exploration versus exploitation in reinforcement learning:

Conclusions

Thank you.

Haoran Wang (Columbia University) Exploratory MV and Reinforcement Learning 42 / 42