robust log-optimal strategy with reinforcement learning · 2 robust log-optimal strategy with...

1

Robust Log-Optimal Strategy with Reinforcement Learning

YiFeng Guo XingYu Fu YuYan Shi MingWen LiuSun Yat-sen University

We proposed a new Portfolio Management method termed as Robust Log-Optimal Strategy(RLOS), which ameliorates the General Log-Optimal Strategy (GLOS) by approximating thetraditional objective function with quadratic Taylor expansion. It avoids GLOS’s complex CDFestimation process, hence resists the ”Butterfly Effect” caused by estimation error. Besides, RLOSretains GLOS’s profitability and the optimization problem involved in RLOS is computationallyfar more practical compared to GLOS. Further, we combine RLOS with Reinforcement Learning(RL) and propose the so-called Robust Log-Optimal Strategy with Reinforcement Learning(RLOSRL), where the RL agent receives the analyzed results from RLOS and observes thetrading environment to make comprehensive investment decisions. The RLOSRL’s performanceis compared to some traditional strategies on several back tests, where we randomly choose aselection of constituent stocks of the CSI300 index as assets under management and the testresults validate its profitability and stability.Keywords:Portfolio Management; Mathematical Finance; Artificial Intelligence; Informationtheory; Log-Optimal Strategy; Robustness Analysis; Reinforcement Learning; Deep Learning;Convolutional Neural Network.

1 Introduction

Portfolio Management (PM), aiming to balancerisk and return optimally, sequentially allocat-ing an amount of wealth to a collection of as-sets in some consecutive trading periods basedon investors’ return-risk profile [9], is a crucialproblem in both theoretical and practical aspectsof finance. There are mainly four types of PMstrategies [11] proposed from both industry andacademia, i.e. Follow-the-Winner, which allo-cates wealth to the best performed asset; Follow-the-Loser, which allocates wealth to the worstperformed asset; Pattern-Matching, which pre-dicts the distribution of price fluctuation of thenext trading period based on historical data andchooses the optimal portfolio according to theprediction; Meta-Learning, which combines sev-eral different PM strategies to form a better-performed Ensembled investment policy; In ourwork, we focus on the latter two types, where weensemble two algorithms from Pattern-Matchingcategory, i.e. Log-Optimal based Strategy andReinforcement Learning based Strategy, to formour final PM policy.

The General Log-Optimal Strategy (GLOS),originated from information theory [1], attempt-ing to maximize the expectation of the logarith-mic rate of return by choosing optimal portfolioweight, is a natural and one of the most renownedPM methods, where many exciting results emerge

[10,12,13]. In our work, we will prove rigorouslythat GLOS is endowed with many elegant char-acteristics, i.e. Information-Benefit, Greed, andLong-Term Superiority, which justifies its impor-tance. While in practice, the complicated estima-tion of Cumulative Distribution Function (CDF)and high computational complexity of GLOSmake it hard to employ in finance industry, andtherefore we propose the so-called Robust Log-Optimal Strategy (RLOS) which avoids GLOS’scomplex CDF estimation process, hence resiststhe ”Butterfly Effect” caused by estimation error.Besides, RLOS retains GLOS’s profitability andthe optimization problem involved in RLOS iscomputationally far more practical compared toGLOS.

Reinforcement Learning (RL) has wit-nessed tremendous success in various agent-environment interaction tasks [2,3,4] and itis believed to a possible avenue to the Gen-eral Artificial Intelligence, where many largecompanies (e.g. Google DeepMind; Facebook;Baidu etc.) are leading its pioneer research. Itis also natural to apply the RL technique tothe PM scenario [14,15], where the agent getsreward if its investment decision increases thelogarithmic rate of return and gets punishmentif it decreases. In our work, we train a RL agentmanaging a selection of assets on the highlynoisy finance environment [5], who receives the

arX

iv:1

805.

0020

5v1

[q-

fin.

PM]

1 M

ay 2

018

2 Robust Log-Optimal Strategy with Reinforcement Learning

analyzed results from RLOS and observes thetrading environment using deep ConvolutionalNeural Network (CNN) simultaneously. Theloss function of the RL algorithm, which iscomposed of reward and penalty, is designed toguide the virtual portfolio manager to trade in amaximizing-profit way. We term the ensemble ofRLOS and RL as Robust Log-Optimal Strategywith Reinforcement Learning (RLOSRL).

Our strategies are back-tested on several ran-dom selections(Bootstrapping) of constituentstocks of CSI300 index competing with Nave-Average, Follow-the-Loser, and Follow-the-Winner. RLOSRL and RLOSRL outperforms allthese three strategies in all of the back tests, vali-dating the profitability and stability of our algo-rithm. Further, we can see from the back teststhat the trading behaviors of RLOS and RLOSRLare similar to some extent, while RLOSRL de-feats RLOS in all back tests, demonstrating theadvances of RL.1

2 General Log-Optimal Strategy

PM is one of the most common investmentmethods by managing a collection of assets to op-timize a certain objective function. The objectivefunction is often the trade-off between expectedrisk and expected return for a long-term invest-ment horizon, while in the context of GLOS, theobjective function is the expectation of logarith-mic rate of return. In this section, we will firstdefine the GLOS formally and then we will studythree important characteristics of the strategy, i.e.Information-Benefit, Greed and Long-Term Supe-riority.

2.1 Definition of GLOSSay we have d assets under management in a

single trading period. Let X = (X1, X2, ..., Xd)T

represent the asset price fluctuation vector in asingle trading period and b = (b1, b2, ..., bd)

T de-note the portfolio weight vector that we set at thestart of the trading period. For X, each Xi in Xis defined as the ratio of the closing price to theopening price of the ith asset over the trading pe-riod and X can be deemed as a random vectorwhere we use F to represent its CDF.For b, eachbi in b is defined as proportion of the ith asset’sinvestment to the total investment and choosing

1The implementation can be viewed athttps://github.com/fxy96/Robust-Log-Optimal-Strategy-with-Reinforcement-Learning

the optimal b at the start of the trading period isthe mission of portfolio manager.

In the context of GLOS, the portfolio managertries to maximize the expected logarithmic rate ofreturn, i.e. choosing b∗X satisfying the followingequation:

b∗X ∈ arg maxb∈B

rX(b) (1)

where rX(b) = Elog(bTX) =∫

log(bTx)dF(x) isthe expected logarithmic rate of return and B isthe constrain exerted on b.

The above analysis is set on a single tradingperiod, while in practice, we usually trade onconsecutive trading periods, which results in asequence of {Xi}n

i=1. If {Xi}ni=1 is independent

and identically distributed (i.i.d). over differenttrading periods, then optimal portfolio weightvector which maximizes the expectation of thelogarithmic rate of return is the same for everytrading period since the objective function is thesame and no interdependence of different periods.While in general case, the i.i.d assumption doesnot hold.

2.2 Characteristics of GLOS

The reason why we study the GLOS is theelegant mathematical foundations behind it, i.e.Information-Benefit, Greed, and Long-Term Su-periority.

2.2.1 Information-Benefit

Market Information influences the trading en-vironment directly in a way that can be de-tected by the GLOS. We call this property asInformation-Benefit.

Say we now trade on a sequence of consecutivetrading periods. When new market informationarrives at each trading period, the i.i.d assump-tion of the distribution of {Xi}n

i=1 will be violatedand therefore we can’t just maintain a commonportfolio weight vector b∗X for every trading pe-riod. While the information also brings informa-tion benefit, i.e. it will increase the expectation ofthe optimal logarithmic rate of return which willbe illustrated below.

Denote the information by Y and the condi-tional distribution of X given Y = y by F(X|Y =

y) at each trading period. Let bT∗X|Y be the optimal

Robust Log-Optimal Strategy with Reinforcement Learning3

portfolio weight vector such that:

bT∗X|Y ∈ arg max

b∈BrX|Y(b)

= arg maxb∈B

∫log(bTx)dF(x|Y = y)

(2)

The increment of expected logarithmic rate ofreturn is defined as:

∆VY = rX|Y(bT∗X|Yx)− rX|Y(b

T∗X x) (3)

∆VY satisfies some elegant mathematical prop-erties which give it some reasonable constrains.

Theorem 1 Suppose ∆V = E(∆VY). ∆VY and ∆Vsatisfies:

1) ∆VY ≥ 0

2) ∆VY ≤∫

fX|Y=y(x)logfX|Y=y(x)

f (x) dx

3) ∆V ≤∫∫

h(x, y)log h(x,y)f (x)g(y) dxdy

where f , g are the marginalized density functions forX and Y respectively and h is the joint density functionof X and Y.See the proof in the appendix.

Note that ∆VY and ∆V are controlled by someupper bounds, which means the increment of ex-pected logarithmic rate of return brought by in-formation benefit is not in an unreasonable scale.

Further, note that ∆V has an upper bound∫∫h(x, y)log h(x,y)

f (x)g(y) dxdy, which is the mutual in-formation between X and Y, a concept frequentlystudied in information theory [1]. When X andY are independent, we see that ∆V = 0 (Inde-pendence implies that the joint density functionequals to the product of marginalized densityfunctions), which indicates that no informationhas been detected by the trading strategy. WhenX is completely determined by Y, this upperbound is exactly the entropy of the informationY, another important concept studied in informa-tion theory [1]. Further explanation of the upperbound may be intriguing but is out of the scopeof this paper and we leave the exploration of thisavenue for future research.

2.2.2 Greed and Long-Term Superiority

Say we trade on some consecutive trading pe-riods. When {Xi}n

i=1 is i.i.d, the GLOS maintainsa certain fixed portfolio weight vector b∗X for alltrading periods at the start. While, due to the

market information which we have discussed insection 2.2.1, the i.i.d. assumption does not hold.Fortunately, the i.i.d. assumption is not neces-sary. The GLOS can make greedy investmentdecision for every trading period, i.e. only focus-ing on maximizing the expected logarithmic rateof return for each single trading period, whilethe strategy is still superior to other strategiesasymptotically in the view of final gross wealth.

Denote the final gross wealth at the end ofthe nth trading period with a sequence of port-folio weight vector {bi}n

i=1 by Sn = S0Πni=1bT

i Xiand the final gross wealth using the GLOS S∗n =

S0Πni=1b∗Ti Xi. Next theorem indicates the superi-

ority of the General Log-Optimal Strategy overother PM strategies.

Theorem 2 S∗n is asymptotically superior to Sn withprobability 1.Proof: See the proof in the appendix.

3 Robust Log-Optimal Strategy

There are several disadvantages to implementthe GLOS in finance industry. In GLOS, we as-sume the CDF F(x) of price fluctuation vector Xis given, while in practice, it’s impossible for port-folio manager to know F(x) in advance. To im-plement GLOS, they need to estimate F(x) fromhistorical data with certain assumptions, wherethe estimation error and the improper assump-tions may cause the so-called ”Butterfly Effect”.Besides, even if we have known F(x) already, itis computationally expensive to optimize the ob-jective function

∫log(bTx)dF(x) of GLOS since

the high dimensionality of X and b and the loga-rithmic operation in the expression.

Therefore, we propose RLOS, where we don’testimate F(x) and maximize

∫log(bTx)dF(x) di-

rectly. In RLOS, we introduce a new objectivefunction called the Allocation Utility, which isa quadratic Taylor approximation to the expec-tation of the logarithmic rate of return, and it isso simple that only involves the expectation andcovariance of X. Indeed, it can be viewed as thetrade-off between the logarithmic return and itssquared coefficient of variation.

RLOS is computationally effective comparedto GLOS, and it is robust in the sense that the up-per bound of Allocation Utility deviation can becontrolled by the L1-norm of portfolio weight vec-tor and L∞-norm of the bias of covariance matrixestimator.


3.1 Objective function of RLOS(Allocation Utility)

The GLOS has (1). However, the optimizationproblem relies on the distribution function of X,which is hard to know in practice.

Supposing X’s expectation is µ and its covari-ance matrix is Σ , in RLOS, we adopt quadraticTaylor expansion to approximate Elog(bTX).

Thus, we define the Allocation Utility, whichis the objective function of RLOS, as:

M(b, µ, Σ) = log(bTµ)− 12(bTµ)2 bTΣb (4)

See the details o f computation in the appendix.

By maximizing M(b, µ, Σ), which is the mis-sion of RLOS, we can estimate the solution toGLOS in a robust manner. Note that M(b, µ, Σ)

doesn’t involve the distribution function of X andthe parameters to be estimated are only the expec-tation and covariance of X, which is far more prac-tical than the objective function in GLOS, wherethe estimation of F(x) is required.

Besides, M(b, µ, Σ) can be deemed as a trade-off between the logarithmic return and itssquared coefficient of variation, which somehowcoincides with Markowitz’s ”expected returnsvariance of returns” rule [9].

3.2 Optimal portfolio weight vec-tor for RLOS

Consider the optimization problem in RLOS:

bopt ∈ arg max M(b, µ, Σ), s.t.b ∈ B

bopt can be solved analytically for some natu-ral constrain region if µ, Σ are given. For exam-ple, we take B = {b|bTe = 1, bTµ ≥ c, where e =

(1, 1, ..., 1)T}. The first constraint follows from thedefinition of portfolio weight vector and the sec-ond constraint accounts for the minimal expectedrate of return. We apply Karush-Kuhn-Tuckercondition to solve this optimization problem. Weprovide the computation in the appendix.

From the optimazition procedure, we can findthat bopt depends on the value of µ and Σ . Ifthe estimation of µ,Σ deviates from their true val-ues, it may lead to deviation of estimation of op-timal portfolio weight, which may result in theso-called ”Butterfly Effect”. Thus, we need tocontrol our estimation process to reduce estima-tion deviation. In the next section, we will provethat the RLOS is robust which provides toleranceagainst reasonable estimation error.

3.3 Robustness Analysis of RLOS

Suppose that b̂opt is the optimal portfolioweight vector estimator by replacing µ and Σ

with their estimators: µ̂ and Σ̂, in the optimiza-tion procedure. The estimation error of µ̂ and Σ̂

may cause the so-called ”Butterfly Effect”. Hence,we need to study the robustness of RLOS.

We first give two reasonable assumptions here:

1) Suppose E(µ̂) = µ, i.e. µ̂ is µ’s unbiasedestimation.

2) Let the entry in the ith row and jth column ofΣ− Σ̂ be σij. We assume that max

iΣn

j=1|σij| ≤M, where M is a positive constant.

From 1), we know E(b̂optTµ̂) = b̂optT

µ. ByLaw of Large Number, ∀ε > 0 when sample sizen → ∞, we have P(|b̂optT

µ̂− b̂optTµ| > ε) → 0.

∀ε > 0, ∃n0 ∈ N+ such that b̂ satisfies |b̂optTµ̂−

b̂optTµ| ≤ ε when sample size is more than n0.

Similar analysis with 2) so we don’t repeat it here.Based on the above two assumptions, we

now study deviation between M(b̂opt, , µ̂, Σ̂) andM(b̂opt, µ, Σ), which reflects the robustness ofRLOS.

Theorem 3 Suppose bTµ ≥ c and satisfying theassumptions above, the bias of the RLOS following theestimation has an upper bound.

|M(b̂opt, µ̂, Σ̂)−M(b̂opt, µ, Σ)|

≤ 12c2 (

n

∑i=1|b̂i|)2 max

iΣn

j=1|σij|(5)

Proof: See the proof in the appendix.

Thus, if we choose a suitable constant c0 > 0and restrict ∑n

i=1 |b̂i| ≤ c0, we can ensure thatthe estimation error is controlled. Hence, toachieve a robust optimization result and make|M(b̂opt, µ̂, Σ̂)−M(b̂opt, µ, Σ)| controlled by anupper bound, we add an extra ∑n

i=1 |b̂i| ≤ c0 con-straint in the optimization procedure. We can ex-plain the parameter c0 financially, which is largerthan 1 if we allow short-selling and equals to 1 ifshort-selling is forbidden.

3.4 Implementation of RLOS

Say we now trade on the kth trading period, toimplement RLOS, the portfolio manager need to


estimate the parameters involved in the objectivefunction:

M(b, µ, Σ) = log(bTµ)− 12(bTµ)2 bTΣb (4)

i.e. we need to estimate the expectation µ andthe covariance matrix Σ of Xk in the kth tradingperiod.

To estimate the parameters, our PM strategyselects a collection of trading periods that are sim-ilar to the kth trading period and we then calcu-late µ̂ and Σ̂ based on the collection, which isthe methodology of Pattern-Matching[11]. Theproblem remains to be solved is how to definesimilarity between trading periods, where we de-fine it as the Pearson Correlation between mar-ket backgrounds of trading periods. We now gospecifically into the implementation.

3.4.1 Definition of Market Backgrounds

Consider trading background of the ith tradingperiod, we define it as the price fluctuation matrixreflecting the market situation from (i− n)th to(i− 1)th trading periods. Formally, we write it as:

Background(ith, n) =

x1,(i−n) · · · x1,(i−1)

.... . .

...x(m,1−n) · · · x(m,(i−1))

where x(a,t) is price fluctuation (ratio of clos-

ing price to the opening price) of the ath asset inthe tth trading period. Note that we have n asa hyperparameter of background controlling thelength of history under consideration for eachtrading period. To estimate the statistics of Ximore accurately, in practice, we use multiple n todefine multiple backgrounds for the same tradingperiod.

If Background(ith, n) and Background(jth, n)are ”similar” in some sense, then Xj may con-taints useful information that can predict thestatistics of Xi.

3.4.2 Definition of Similarity and SimilarTrading Periods Selection

We use Pearson Correlation to define similaritybetween trading periods. Formally, we write as:

Similar(ith, jth, n)

=corr∗(Background(ith, n), Background(jth, n))

, where corr∗ is the Pearson Correlation. Wealso set hyperparameter ρ as a standard to

determine the notion of ”similar”, where ifSimilar(ith, jth, n) > ρ, we say ith and jth tradingperiods are similar and vice versa.

Having defined the notion of ”similar”, wenow proceed to select periods with similar back-grounds with the kth trading period, which formsa set:

S(k, n, ρ) = {k−n ≤ i < k|Similar(ith, jth, n) > ρ}

3.4.3 Algorithm for RLOS

So far, we have developed enough notionsto implement the algorithm for RLOS. Say wetrade on the kth period, for different historylength n, the selection of similar trading peri-ods may vary and therefore we have bunch of

estimated parameters, i.e. {µ̂(n)k , Σ̂

(n)k }N

n=1. For

each {µ̂(n)k , Σ̂

(n)k }N

n=1, the optimal portfolio vector

b̂opt(n)

k can be obtained by maximizing the corre-sponding objective function and we end up witha collection of portfolio vectors and the missionleft now is how to ensemble them into a singleportfolio vector for trading. Here we assign each

b̂opt(n)

k with a weight w(n) ∈ R which reflects theprofitability of the vector and then ensemble themlinearly.

To summarize the above discussion, we writethe algorithm here:

Algorithm 1 RLOS

Input: N, ρ, k, HistoricalPriceData;Output: optimal b̂opt

k for thekth trading period;

1: for n = 2 : N do2: Construct S(k, n, ρ)3: if |S(k, n, ρ)| ≤ 14: continue5: else:6: µ̂

(n)k = Mean({Xi|i ∈ S(k, n, ρ)})

7: Σ̂(n)k = Cov({Xi|i ∈ S(k, n, ρ)})

8: b̂opt(n)

k = arg max M(b, µ̂(n)k , Σ̂

(n)k )

9: w(n) = log(Πi∈S(k,n,ρ)b̂opt(n)

k Xi)10: end for11: return b̂opt

k = (Σw(n)b̂opt(n)

k )/(Σw(n))


4 Robust Log-Optimal Strategywith Reinforcement Learning

In this paper, we construct an automatic RLtrading agent interacts with the low signal-noiseratio stock market environment E(t). Before theopening of the tth trading period of stock market,the agent receives analyzing results vt from RLOSand observes the recent historical data st . Thenthe agent predicts the optimal opening portfolioweights bpre

t and the logarithmic rate of return ofthe tth trading period rpre

t based on the informa-tion he(she) just receives. If we denote the CNNwith parameter θ as fθ, then we can summarizethe above decision-making process as follow:

(bpret , rpre

t ) = fθ(st, vt) (6)

After predicting the optimal opening portfolioweights bpre

t , we must carry out stock transactionto transform bend

t−1 to bpret once the stock market

commences. Here we assume that the transactioncan be carried out instantly, which implies thatwe start as bpre

t at the very beginning of the tthtrading period. We also assume that the transac-tion fee is zero in our work which is a reasonableassumption since the trading frequency of ourstrategy is relatively low (once a day in the backtest) comparing to high frequency trading, wherethe transaction fee is high, and the correspondingfee is also trivial comparing to the total invest-ment and the stock price fluctuation in our case.

It’s common in finance to find non-stationarytime series, which means the distribution of finan-cial data may vary as time develops [5]. Therefore,it’s crucial that we train our network constantlyto fit into the ever-changing environment E(t). Totrain our network, we define the loss function asfollow:

L(θ) =α(rpret − rtrue

t )2 − βboptt · log(bpre

t )

− σrtruet + c||θ||

(7)

where α, β, σ, c > 0 are user specified hyper-parameters determining the importance of eachterms in loss function and the detail of the aboveloss function will be discussed in section 6.4. SGDalgorithm with history reexperience mechanism[4] is used here to minimize the above loss func-tion.

4.1 PredictingBefore the opening of the tth trading period,

the RL trader receives two information: vt which

is the estimated optimal portfolio weights vectorpredicted by the RLOS algorithm and st whichis the recent historical data. We have discussedthe RLOS algorithm thoroughly in section 5 andtherefore we only discuss the formation of st here.

st is a tensor the reflects the trading circum-stances of d assets in recent n trading periods.Thetrading circumstances that st takes into considera-tion are the opening, highest, lowest price and thetrading volume of each asset in all the n tradingperiods. Formally, st is a d× n× 4 tensor whererows represent different assets, columns repre-sent different periods, and the third dimensionrepresents various aspects of the recent tradingenvironment.

The RL trading agent feeds st into the Convo-lutional Neural Network and plugs vt into thefeature map tensors before the output layer. Thetopology of the output layer is a concatenationof a Softmax layer and fully connected layer. Itoutputs the predicted optimal opening portfolioweights bpre

t and the predicted logarithmic rate ofreturn of the tth trading period rpre

t respectively.

The structure of the network is shown as Fig-ure1.

Figure 1. Topology of the Network

Filters of the network are all one-dimension forthe following two reasons:

1) The AI trader can manage different number ofstocks without changing the network architec-ture, and therefore enhance the flexibility ofour trading system.

2) Due to the shared-weight architecture of CNN,the filters are trained to capture some commoncharacteristics of the fluctuation of differentstocks independently.


4.2 Trading EvaluationAfter setting the opening portfolio weights of

the tth trading period to bpret , the market com-

mences, and our RL agent do not trade till theopening of next trading period (t + 1). At theend of the tth trading period, we can evaluate theperformance of the initial portfolio weights bpre

tand train out neural network later by calculatingthe logarithmic rate of return rtrue

t .To recap, we write the formula of again:

rtruet = log(bpre

t Xt) (8)

where Xt is the price fluctuation vector that weintroduced in section 3.1.

4.3 TrainingTo recap, we write the loss function here again:

L(θ) =α(rpret − rtrue

t )2 − βboptt · log(bpre

t )

− σrtruet + c||θ||

(7)

where α, β, σ, c > 0 are user specified hyper-parameters determining the importance of eachterms in loss function. Now we explain each termin detail:

α(rpret − rtrue

t )2: this term reflects the squarederror between the network predicted logarithmicrate of return and the true logarithmic rate ofreturn, which is calculated at the end of the tthtrading period.−βbopt

t · log(bpret ): this term reflects the cross

entropy between the network predicted opti-mal opening portfolio weights bpre

t and the ”fol-low the winner” opening portfolio weights bopt

twhich is calculated by setting bopt

(t,i) = 1 and

bopt(t,j) = 0(∀j 6= i) where X(t,i) is largest entry

in price fluctuation vector Xt .The reason why wename bopt

t as ”optimal” portfolio is that:

boptt ∈ arg max log(b · Xt)

Thus, if bpret is more similar to bopt

t , then we canachieve better logarithmic rate of return rtrue

t .−σrtrue

t : this term reflects the true logarithmicrate of return in the tth trading period as an in-stant reward of setting opening portfolio weightsto bpre

t . Apparently, the agent tries to maximizertrue

t , which is the goal of our trading system, byminimizing the loss function L(θ).

c||θ||: This is a classical term which preventsover-fitting by L2-norm regulation.

After each trading period ends, we need totrain our network to update the latest market in-formation. Here we use the history reexperiencemechanism[4] by randomly sampling some his-torical trading records and use SGD algorithmwith momentum to minimize the loss function onthese records, that is we update the parameter θ

as follow:θ := θ − l · ∇L(θ) (9)

where l > 0 is the user specified learningrate.Mechanisms like Batch Normalization [6],Optimal Initialization for Relu [8] are used to im-prove the model’s performance.

The purpose of random sampling is to breakthe correlations of consecutive samples, whichis thoroughly studied in [4]. Instead of treatingeach sample equally as Mnih, V. et al ’s workwhere they use uniform distribution samplingtechnique, we use Poisson distribution to empha-size the recent samples as the data distribution infinance market is often non-stationary [5].

Formally, consider the training at the end ofthe tth trading period. We sample the ξth trad-ing record to implement SGD algorithm, where0 < ξ ≤ t. In our case, the random variable(t− ξ) subjects to the Poisson distribution withparameter λ > 0 , that is:

(t− ξ) ∼ P(λ)

5 Back Test

We use Nave-Average, Follow-the-Winner andFollow-the-Loser as our baseline PM strategies,competing with the proposed RLOS and RLOSRLin several independent back tests. For each ex-periment, we randomly select 100 constituentstocks of CSI300 index (use bootstrapping sam-pling method) as the assets under managementand the results of back tests suggest the superior-ity of our strategies.

5.1 Back Tests for RLOSWe run several back tests with different trad-

ing length on different stocks to evaluate theRLOS’s performance. We set hyperparametersN (the maximal length of historical backgroundfor each trading period) to 20 and ρ (the standardfor similarity) to 0 in all the experiments. Wecan see clearly from back tests that RLOS outper-forms all the other strategies in both short-termtrading and long-term trading. We can see clearlyfrom back tests that RLOS outperforms all the


other strategies in both short-term trading andlong-term trading. For all the figure below, thehorizontal axis represents time and the verticalaxis represents total wealth.

(a) (b)

Figure 2. Trading for 500 days (RLOS)

(a) (b)


(a) (b)


5.2 Back Tests for RLOSRLWe validate the performance of RLOSRL on

several back tests. The RL agent is trained ondata before June 1st 2010 to avoid data leakage.To some extent, the trading behaviors of RLOSand RLOSRL are similar since one of the inputs ofRLSORL is the analyzed result from RLOS, whilewe can still see that RLOSRL is superior to RLOSin all the back tests.

(a) (b)

Figure 5. Trading for 500 days (RLOSRL)

Parameter Valueλ: Poisson parameter 50

r: Momontum 0.9α 10−4

β 10−2

σ 10−2

c 10−4

Table 1. Hyperparameter

Hundreds of Steps Learning Rate0-500 10−2

0-1000 10−3

0-1500 10−4

Table 2. Learning Rate Decay

(a) (b)


(a) (b)


6 Conclusion

In this paper, we first analyze the advan-tages of GLOS algorithm, which are Information-Benefit, Greed and Long-Term Superiority, andthen propose the so-called RLOS algorithm,which is robust, profitable, and computationallyeffective, by approximating the objective func-tion of GLOS using Taylor expansion. We furthercombine the RLOS with RL technique to forman ensembled strategy, where the PM agent istrained in a way to maximize the expected loga-rithmic rate of return of investment. The stabilityand profitability of our methods are empiricallyvalidated on several independent experiments.


We leave some possible avenues for future explo-ration here:

• Customize our PM strategies on other typesof market. say Future, Currency, Bond etc.

• Customize our PM strategies on high fre-quency trading.

• Combine assets selection strategy with ourPM strategies.

• Take transaction fee into consideration.

• Estimate the parameters involved in ourmodel more precisely.

• Approximate the objective function of GLOSmore precisely.

References

[1] Cover, T.M. and Thomas, J.A., 2012. Elementsof information theory. John Wiley & Sons. Van-couver[2] Silver, D., Huang, A., Maddison, C.J., Guez,A., Sifre, L., Van Den Driessche, G., Schrittwieser,J., Antonoglou, I., Panneershelvam, V., Lanctot,M. and Dieleman, S., 2016. Mastering the gameof Go with deep neural networks and tree search.nature, 529(7587), pp.484-489.[3] Silver, D., Schrittwieser, J., Simonyan, K.,Antonoglou, I., Huang, A., Guez, A., Hubert, T.,Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017.Mastering the game of go without human knowl-edge. Nature, 550(7676), p.354.[4] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu,A.A., Veness, J., Bellemare, M.G., Graves, A.,Riedmiller, M., Fidjeland, A.K., Ostrovski, G. andPetersen, S., 2015. Human-level control throughdeep reinforcement learning. Nature, 518(7540),p.529.[5] Lopez de Prado, M., 2018. The 10 ReasonsMost Machine Learning Funds Fail.[6] Ioffe, S. and Szegedy, C., 2015, June. Batchnormalization: Accelerating deep network train-ing by reducing internal covariate shift. In In-ternational conference on machine learning (pp.448-456).[7] Nair, V. and Hinton, G.E., 2010. Rectifiedlinear units improve restricted boltzmann ma-chines. In Proceedings of the 27th internationalconference on machine learning (ICML-10) (pp.807-814).

[8] He, K., Zhang, X., Ren, S. and Sun, J., 2015.Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conferenceon computer vision (pp. 1026-1034).[9] Markowitz, H., 1952. Portfolio selection. Thejournal of finance, 7(1), pp.77-91.[10] Algoet, P.H. and Cover, T.M., 2011. Asymp-totic optimality and asymptotic equipartitionproperties of log-optimum investment. In THEKELLY CAPITAL GROWTH INVESTMENT CRI-TERION: THEORY and PRACTICE (pp. 157-179).[11] Li, B. and Hoi, S.C., 2014. Online portfolioselection: A survey. ACM Computing Surveys(CSUR), 46(3), p.35.[12] Ormos, M. and Urbn, A., 2013. Performanceanalysis of log-optimal portfolio strategies withtransaction costs. Quantitative Finance, 13(10),pp.1587-1597.[13] Vajda, I., 2006. Analysis of semi-log-optimalinvestment strategies. In Prague Stochastics (pp.719-727).[14] Jiang, Z. and Liang, J., 2016. Cryptocurrencyportfolio management with deep reinforcementlearning. arXiv preprint arXiv:1612.01277.[15] Dempster, M.A. and Leemans, V., 2006. Anautomated FX trading system using adaptive re-inforcement learning. Expert Systems with Ap-plications, 30(3), pp.543-552.


Appendix

A Proof of Theorem 1

Lemma 1 E(log(φ(x))) ≤ log(E(φ(x))), ∀r, v, x ≥ 0, φ(x) > 0.Proof: Since the logarithm function satisfies

log(λx1 + (1− λ)x2) ≥ λlog(x1) + (1− λ)log(x2), λ ∈ [0, 1]

which shows it is a concave function.Let λ = x2−x

x2−x1, when x ∈ [x1, x2]. Then we have log(x) ≥ x2−x

x2−x1log(x1) +

x−x1x2−x1

log(x2).The inequality is equivalent to

1x− x1

[log(x)− log(x1)] ≥1

x2 − x1[log(x2)− log(x1)]

Let x → x1, we have

(x2 − x1)log′(x1) ≥ log(x2)− log(x1)

Let x0 = ∑mi=1 λixi. When ∑m

i=1 λi = 1, λi > 0.For each i, we have

λi(xi − x0)log′(x0) ≥ λi[log(xi)− log(x0)]

Thusm

∑i=1

λi(xi − x0)log′(x0) ≥m

∑i=1

λi[log(xi)− log(x0)]

Since

E(log(φ(x))) =∫

log(φ(x))dF(x)

= limt→∞

∞

∑k=1

log(φ(xk))(F(kn)− F(

k− 1n

))

log(E(φ(x)) = log(∫

φ(x)dF(x))

= log( limt→∞

∞

∑k=1

φ(xk)(F(kn)− F(

k− 1n

))

and the logarithm function is continuous, let λk = F( kn )− F( k−1

n ), m→ ∞, then we have

∫log(φ(x))dF(x) ≤ log(

∫φ(x)dF(x))

So E(log(φ(x))) ≤ log(E(φ(x))) as desired.

Lemma 2 If b∗ is the optimal portfolio and E bT Xb∗T X

exists, we have E bT Xb∗T X

≤ 1, for any other portfolio b.

Proof: Let W(bλ, F) =∫

log(bTλ x)dF(x), bλ = λb + (1− λ)b∗, where b is another portfolio. When λ = 0,

we have b0 = b∗.According to the definition we have the greatest value

W(b0, F) = W(b∗, F) = maxb∈A

∫log(bT x)dF(x)

We say W(b0, F) ≥W(bk, F), ∀k ∈ [0, 1] and dW(bλ ,F)dλ ≤ 0 when λ→ 0 by the definition of derivative.


B Proof of Theorem 2

Proof:According to lemma 2, we have E Sn

S∗n≤ 1 and

Pr(Sn > n2S∗n) = Pr(Sn

S∗n> n2)

=∫ +∞

n2dF(

Sn

S∗n)

≤ 1n2

∫ +∞

n2

Sn

S∗ndF(

Sn

S∗n)

≤ 1n2

∫ +∞

0

Sn

S∗ndF(

Sn

S∗n)

≤ 1n2 E

Sn

S∗n

≤ 1n2

That is to say,

Pr(1n

logSn

S∗n>

1n

logn2) ≤ 1n2

∞

∑n=1

Pr(1n

logSn

S∗n>

2lognn

) ≤ ∑n=1

∞1

n2 < ∞

and

Pr( limn→∞{ 1

nlog

Sn

S∗n>

2lognn

)}) = limk→∞

Pr(∞⋃

n=k{ 1

nlog

Sn

S∗n})

≤ limk→∞

∑n=k

Pr({ 1n

logSn

S∗n})

= 0

This implies, ∃N > 0, ∀n > N, we have

1n

logSn

S∗n≤ 2logn

n

Thus, we have limn→∞

1n log Sn

S∗n≤ 0, with probability 1 as desired and conclude that S∗n is asymptotically superior

to Sn.

C Computation of Taylor Expansion

Elog(bTX)

≈E(logE(bTX) +bTX− E(bTX)

E(bTX)− (bTX− E(bTX))2

2(E(bTX))2 )

=log(bTµ)− 12(bTµ)2 bTΣb

(4)

D Proof of Theorem 3

Lemma 3 ∀p, q, q1, q2 ∈ R, satisfying q1 ≤ q ≤ q2, we have

|p− q| ≤ max p− q, q2 − p

Proof: If p ≥ q, p− q1 ≥ p− q ≥ 0, which refers to p− q1 > |p− q|.If p < q, q2 − p ≥ q− p > 0, which refers to q2 − p > |p− q|.we then prove that |p− q| ≤ max p− q, q2 − p.


Next, we will study the deviation between M(b̂opt, µ̂, Σ̂) and M(b̂opt, µ, Σ).


=|log(b̂optTµ̂)− 1

2(b̂optTµ̂)2

b̂optTΣ̂b̂opt − log(b̂optT

µ) +1

2(b̂optTµ)2

b̂optTΣb̂opt|

=|(log(b̂optTµ̂)− log(b̂optT

µ)) + (1

2(b̂optTµ)2

b̂optTΣb̂opt − 1

2(b̂optTµ̂)2

b̂optTΣ̂b̂opt)|

≤|log(b̂optTµ̂)− log(b̂optT

µ)|+ | 1

2(b̂optTµ)2


2(b̂optTµ̂)2

b̂optTΣ̂b̂opt|

=|log(b̂optT

µ̂

b̂optTµ)|+ 1

2| 1

(b̂optTµ)2


(b̂optTµ̂)2

b̂optTΣ̂b̂opt|

≤|log(b̂optT

µ̂

b̂optTµ)|+ 1

2| 1

(b̂optTµ)2


(b̂optTµ̂)2

b̂optTΣ̂b̂opt|

=|log(b̂optT

µ̂ + ε

b̂optTµ̂

)|+ 12| 1

(b̂optTµ)2


(b̂optTµ̂)2

b̂optTΣ̂b̂opt|

=|log(1 +ε

b̂optTµ̂)|+ 1

2| 1

(b̂optTµ)2


(b̂optTµ̂)2

b̂optTΣ̂b̂opt|

Since b̂optTµ̂ ≤ c, we have |log(1 + ε

b̂optT µ̂)| ≤ log(1 + ε

c ) for the first element of RHS.

For the second element of RHS, according to lemma 3, we have:

| 1

(b̂optTµ̂)2

b̂optTΣ̂b̂opt − 1

(b̂optTµ̂)2

b̂optTΣb̂opt|

≤max{ 1

(b̂optTµ̂)2


(b̂optTµ̂ + ε)2

b̂optTΣb̂opt,

1

(b̂optTµ̂)2


(b̂optTµ̂− ε)2

b̂optTΣb̂opt}

Since −ε ≤ b̂optTµ̂− b̂optT

µ ≤ ε,

1

(b̂optTµ̂)2


(b̂optTµ̂ + ε)2

b̂optTΣb̂opt

=1

(b̂optTµ̂)2

b̂optT[Σ̂− b̂optT

µ̂

b̂optTµ̂ + ε

Σ]b̂opt

≤ 1c2 b̂optT

[Σ̂− b̂optTµ̂

b̂optTµ̂ + ε

Σ]b̂opt

1

(b̂optTµ̂)2


(b̂optTµ̂− ε)2

b̂optTΣb̂opt

=1

(b̂optTµ̂)2

b̂optT[Σ̂− b̂optT

µ̂

b̂optTµ̂− ε

Σ]b̂opt

≤ 1c2 b̂optT

[Σ̂− b̂optTµ̂

b̂optTµ̂− ε

Σ]b̂opt

The above inequalties immediately imply that

| 1

(b̂optTµ̂)2


(b̂optTµ̂)2

b̂optTΣb̂opt| ≈ 1

c2 |b̂optT

(Σ̂− Σ)b̂opt|

Let b̂optT= (b̂1, . . . . . . , b̂n)T , the ith row and jth column element is σij of Σ− Σ̂.


We have

|b̂optT(Σ̂− Σ)b̂opt| = |

n

∑i=1

b̂i(n

∑j=1

b̂jσij)|

≤n

∑i=1|b̂i||

n

∑j=1

b̂jσij|

≤n

∑i=1|b̂i|

n

∑j=1|b̂j||σij|

≤n

∑i=1|b̂i|(

n

∑j=1|b̂j|

n

∑i=1|σij|)

≤n

∑i=1|b̂i|max

i(

n

∑j=1|b̂j|

n

∑i=1|σij|)

= (n

∑i=1|b̂i|)2 max

i

n

∑i=1|σij|

Combining all the computation above, we obtain


≤|log(1 +ε

b̂optTµ̂)|+ 1

2| 1

(b̂optTµ)2


(b̂optTµ̂)2

b̂optTΣ̂b̂opt|

≤log(1 +ε

c) +

12c2 (

n

∑i=1|b̂i|)2 max

i

n

∑i=1|σij|

=1

2c2 (n

∑i=1|b̂i|)2 max

i

n

∑i=1|σij|

E Computation of Optimal Portfolio

Let F(b, α, β) = log(bTµ)− 12(bT µ)2 bTΣb + α(bTe− 1) + β(c− bT¯), where β ≥ 0.

Take ∂F(b,α,β)∂b = 0 and consider β(c− bT¯) = 0, β ≥ 0, bTe− 1 = 0 simultaneously.

− µbT µ− bTbµΣ

(bT µ)3 + Σb(bT µ)2 + αe− βµ = 0

β(c− bTµ) = 0

β ≥ 0

bTe− 1 = 0

Multiply bT in both sides of first equality, then we have

0 = bT [− µ

bTµ− bTbµΣ

(bTµ)3 +Σb

(bTµ)2 + αe− βµ]

= −1 + α− βbTµ

(10)

(Case 1) If β = 0, it implies α = 1. Therefore, we substitute α = 1, β = 0 into equality and get the value ofoptimal portfolio bopt immediately.

(Case 2) If β > 0, we can seebTµ = c by the second equality. According to (10), we have α = 1 + βc.− µ

bT µ− bTbµΣ

(bT µ)3 + Σb(bT µ)2 + αe− βµ = 0

bTµ = c

α = 1 + β

After solving equation system above, we can get the value of optimal portfolio bopt.

robust log-optimal strategy with reinforcement learning · 2 robust log-optimal strategy with...

Documents