dynamic replication and hedging: a reinforcement learning ...€¦ · machine learning in finance...

41
Dynamic Replication and Hedging: A Reinforcement Learning Approach Petter Kolm Courant Institute, NYU [email protected] https://www.linkedin.com/in/petterkolm Published in Kolm and Ritter (2019), “Dynamic Replication and Hedging: A Reinforcement Learning Approach,” The Journal of Financial Data Science, Winter 2019, 1 (1), pp. 159-171. Matlab Computational Finance Conference October 15, 2019 1 / 40

Upload: others

Post on 04-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Dynamic Replication and Hedging AReinforcement Learning Approach

Petter KolmCourant Institute NYU

petterkolmnyueduhttpswwwlinkedincominpetterkolm

Published in Kolm and Ritter (2019) ldquoDynamic Replication andHedging A Reinforcement Learning Approachrdquo The Journal of

Financial Data Science Winter 2019 1 (1) pp 159-171

Matlab Computational Finance ConferenceOctober 15 2019

1 40

2 40

Outline

Background amp motivationMachine learning in financeReplication amp hedgingWhat we do

Reinforcement learning for hedgingBrief introduction to RLAutomatic hedging in theoryAutomatic hedging in practiceExamples

Conclusions

2 40

Background amp motivation

3 40

Machine learning in finance IAdoption of ML has been rapid in the financial industry Why

The data The algos amp tools The infrastructure for data storage amp computation

Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted

Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc

Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling

4 40

Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from

ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you

implement them yourself

Data science amp ML are not cure-alls (who knew)

What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML

5 40

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 2: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

2 40

Outline

Background amp motivationMachine learning in financeReplication amp hedgingWhat we do

Reinforcement learning for hedgingBrief introduction to RLAutomatic hedging in theoryAutomatic hedging in practiceExamples

Conclusions

2 40

Background amp motivation

3 40

Machine learning in finance IAdoption of ML has been rapid in the financial industry Why

The data The algos amp tools The infrastructure for data storage amp computation

Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted

Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc

Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling

4 40

Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from

ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you

implement them yourself

Data science amp ML are not cure-alls (who knew)

What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML

5 40

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 3: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Outline

Background amp motivationMachine learning in financeReplication amp hedgingWhat we do

Reinforcement learning for hedgingBrief introduction to RLAutomatic hedging in theoryAutomatic hedging in practiceExamples

Conclusions

2 40

Background amp motivation

3 40

Machine learning in finance IAdoption of ML has been rapid in the financial industry Why

The data The algos amp tools The infrastructure for data storage amp computation

Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted

Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc

Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling

4 40

Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from

ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you

implement them yourself

Data science amp ML are not cure-alls (who knew)

What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML

5 40

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 4: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Background amp motivation

3 40

Machine learning in finance IAdoption of ML has been rapid in the financial industry Why

The data The algos amp tools The infrastructure for data storage amp computation

Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted

Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc

Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling

4 40

Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from

ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you

implement them yourself

Data science amp ML are not cure-alls (who knew)

What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML

5 40

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 5: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Machine learning in finance IAdoption of ML has been rapid in the financial industry Why

The data The algos amp tools The infrastructure for data storage amp computation

Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted

Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc

Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling

4 40

Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from

ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you

implement them yourself

Data science amp ML are not cure-alls (who knew)

What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML

5 40

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 6: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from

ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you

implement them yourself

Data science amp ML are not cure-alls (who knew)

What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML

5 40

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 7: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Machine learning for trading I

While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster

6 40

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 8: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers

for single- and multi-period problems Bayesian learning and Gaussian processes used for building

forecasts (forecasting distributions) and dealing withestimationmodel risk

Unsupervised learning techniques for building statistical riskmodels

Regime-switching models for dealing with different marketenvironments (eg risk onoff)

Mixed-distributions to model skewed distributions andtail-behavior

Bootstrap and cross-validation techniques for backtesting

7 40

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 9: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for

complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)

NLP LDA amp extensions ICA for analysis of news filings andreports

8 40

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 10: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Replication amp hedging

Replicating and hedging an option position is fundamental infinance

The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously

rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))

In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging

strategy will depend on the desired trade-off betweenreplication error and trading costs

9 40

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 11: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone

Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction

costs

More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)

Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs

10 40

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 12: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Related work II Buehler et al (2018) evaluate NN-based hedging under

coherent risk measures subject to proportional transactioncosts

1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)

11 40

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 13: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

What we doIn the article we Show how to build a reinforcement learning (RL) system which

can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting

Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications

The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo

derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios

of derivative securities12 40

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 14: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Reinforcement learning for hedging

13 40

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 15: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Brief introduction to RL I

RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control

At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set

This choice influences both the transition to the next state aswell as the reward Rt the agent receives

Environment

Reward ActionState

RL Agent

14 40

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 16: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Brief introduction to RL II A policy π is a way of choosing an action at conditional on

the current state st RL is the search for policies which maximize the expectation of

the cumulative reward Gt

E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]

where γ is discount factor (such that the infinite sumconverges)

Mathematically speaking RL is a way to solve multi-periodoptimal control problems

Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)

15 40

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 17: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Brief introduction to RL III The action-value function expresses the value of starting in

state s taking an arbitrary action a and then following policyπ thereafter

qπ(s a) = Eπ[Gt | St = sAt = a] (1)

where Eπ denotes the expectation under the assumption thatpolicy π is followed

If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)

This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a

sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman

equations

16 40

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 18: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

A visual example

YouTube example

17 40

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 19: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Supervised learning vs RL

Training signals equiv target output from training set

Supervisedlearning

Inputs Outputs

Training signals equiv rewards

ReinforcementlearningStates Actions

18 40

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 20: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

RL has a feedback loop

Environment

Training signals equiv rewards

ReinforcementlearningStates Actions

19 40

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 21: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of

actions Need access to the environment for training

So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID

decisions) We are able to observe feedback to state or choice of actions

This information can be partial andor noisy There is a gain to be made by optimizing action choice over a

portion of the trajectory (ie time consistency needed for theBellman equation to hold)

Other ML techniques cannot easily deal with these situations20 40

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 22: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Common challenges when solving RL problems

Specifying the model Representing the state Choosing the set of actions Designing the reward

Acquiring data for training Exploration exploitation High cost actions Time-delayed reward

Function approximation (random forests CNNs) Validation confidence measures

21 40

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 23: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Automatic hedging in theory I

We define automatic hedging to be the practice of usingtrained RL agents to handle hedging

With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance

With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward

22 40

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 24: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the

solution to a mean-variance optimization problem withrisk-aversion κ

max983043E[wT ]minus

κ

2V[wT ]

983044(2)

where the final wealth wT is the sum of individual wealthincrements δwt

wT = w0 +T983131

t=1

δwt

We will let wealth increments include trading costs

23 40

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 25: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Automatic hedging in theory III In the random walk case this leads to solving

minpermissible strategies

T983131

t=0

983043E[minusδwt ] +

κ

2V[δwt ]

983044(3)

whereminusδwt = ct

and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)

With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem

24 40

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 26: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Automatic hedging in theory IV We choose the reward in each period to be2

Rt = δwt minusκ

2(δwt)

2 (4)

Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance

2See Ritter (2017) for a general discussion of reward functions in trading25 40

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 27: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Automatic hedging in practice I

Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock

We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero

The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity

For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares

The state is thus naturally an element of3

S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z

26 40

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 28: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Automatic hedging in practice II The state does not need to contain the option Greeks because

they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their

own

A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models

3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1

27 40

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 29: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Letrsquos put RL at a disadvantage

The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion

(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks

Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment

28 40

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 30: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Simulation assumptions I

We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday

We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity

We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods

We require trades (hence also holdings) to be integer numbersof shares

We assume that our agentrsquos job is to hedge one contract ofthis option

In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01

29 40

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 31: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Simulation assumptions II T-costs For a trade size of n shares we define

cost(n) = multiplier times TickSize times (|n|+ 001n2)

where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact

All of the examples were trained on a single CPU and thelongest training time allowed was one hour

Baseline agent = RL agent trained in a friction-less world

30 40

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 32: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Example Baseline agent (discrete amp no t-costs)

minus200

minus100

0

100

200

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta

31 40

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 33: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Example Baseline agent (discrete amp t-costs)

minus100

0

100

0 10 20 30 40 50

timestep (DT)

valu

e (

dolla

rs o

r sh

are

s)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares

32 40

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 34: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Example T-cost aware agent (discrete amp t-costs)

minus100

minus50

0

50

100

0 10 20 30 40 50

timestep (DT)

valu

e (

do

llars

or

sha

res)

costpnl

deltahedgeshares

optionpnl

stockpnl

stockposshares

totalpnl

33 40

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 35: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Kernel density estimates of total cost amp volatility

000

001

002

003

100 200 300

totalcost

densi

ty

method

delta

reinf

000

025

050

075

100

5 10 15

realizedvol

densi

ty

method

delta

reinf

Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL

34 40

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 36: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Kernel density estimates of total PampL

00

01

02

03

04

minus8 minus6 minus4 minus2 0

studenttstatistictotalpnl

densi

ty

method

delta

reinf

Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 37: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs

The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio

It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning

motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks

36 40

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 38: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

Conclusions II A key strength of the RL approach It does not make any

assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately

37 40

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 39: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

References I

Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63

Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002

Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239

Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654

Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293

Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042

Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311

Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364

Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609

Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University

38 40

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 40: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning

Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171

Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301

Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131

Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183

Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89

Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615

Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165

Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge

Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers

Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263

39 40

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40

Page 41: Dynamic Replication and Hedging: A Reinforcement Learning ...€¦ · Machine learning in finance I Adoption of ML has been rapid in the financial industry. Why? The data The algos

References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model

for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324

40 40