predicting the volatility index returns using machine learning · 2017. 11. 6. · predicting the...

Predicting the Volatility Index Returns UsingMachine Learning

by

Michael Yu

A thesis submitted in conformity with the requirementsfor the degree of Master of Science

Graduate Department of MathematicsUniversity of Toronto

© Copyright 2017 by Michael Yu

Abstract

Predicting the Volatility Index Returns Using Machine Learning

Michael Yu

Master of Science

Graduate Department of Mathematics

University of Toronto

2017

We probe how predictable the short term future behaviour of the Chicago Board

Options Exchange (CBOE) Volatility Index (ticker symbol VIX) is given past

market price data within the constraints of a simple classic machine learning

framework. We use past VIX and SPX price time windows as input to predict the

movement direction, i.e. sign of the return, of VIX over the next 1 to 6 weekdays.

For successful cases of predicting return direction from one particular weekday

to another particular future weekday, we have moderately reliable accuracies of

between about 55% and 65% depending on the particular time bridge. We find

that 1 day returns are difficult to predict except for a few particular cases, and

as the prediction window grows we have models that can predict more and more

accurately up to a consistent 62% for both 5 days and 6 days in the future.

ii

Contents

1 Background 1

2 Setup 1

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Method 2

3.1 Machine Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . 2

3.1.1 Financial Time Series . . . . . . . . . . . . . . . . . . . . . 4

3.2 Machine Learning Concepts . . . . . . . . . . . . . . . . . . . . . . 5

3.2.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . 5

3.2.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 Machine Learning Primitives . . . . . . . . . . . . . . . . . . . . . 6

3.3.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.4 Specified Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.4.1 Input Space and Output Space . . . . . . . . . . . . . . . . 6

3.4.2 Combinations Features Bank . . . . . . . . . . . . . . . . . 7

3.4.3 Hyperparameter Search . . . . . . . . . . . . . . . . . . . . 9

3.4.4 Cross Validation Techniques . . . . . . . . . . . . . . . . . . 10

4 Results 11

4.1 Classification Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . 11

4.1.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Sample Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.1 Best Committee of the Best XGBoost Models . . . . . . . . 13

4.2.2 XGBoost Specifications . . . . . . . . . . . . . . . . . . . . 13

4.3 Sample Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Discussion 18

5.1 Discovery Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Limitations and Extensions . . . . . . . . . . . . . . . . . . . . . . 19

iii

Bibliography 19

iv

1

1 Background

The SPX index (SPX) is a weighted sum of 500 influential American companies’

stock prices. Its performance over time represents well the economic growth in

the US. The Chicago Options Exchange (CBOE) offers SPX options to traders

so that positions in their portfolio can be hedged. The prices of SPX options fits

as variables in a relation involving the volatility of SPX price in the future. As

the prices of SPX options are determined purely by buy and sell activities in its

market, the volatility of SPX as expected by participants of the stock market can

be meaningfully assigned from market options prices. CBOE’s volatility index,

VIX[2], defines a changing portfolio of SPX options that constantly seeks to track

the mathematically implied volatility over the future 30 days according to market

behaviour.

2 Setup

We obtained SPX data from Yahoo Finance and VIX data from CBOE. The

dates (in yyyy/mm/dd format) forming our dataset ranges from 1990/01/02 to

2017/09/06.

2.1 Definitions

We use [a, b) to denote a range of indices a, a + 1, . . . , b − 1 where a ≤ b (being

an empty set of indices for a = b). Let V = {Vi : i ∈ [i0 − 1, i0 + N)}, denote the

time series of VIX prices. We have on the order of N ≈ 7000 data points. Define

the time series Li = log(Vi) and Ri = log(Vi/Vi−i) over the domain [i0, i0 + N).

Define the parameters p = 30 and q, which mean the past time window size

and the future prediction horizon size, respectively. Use the variable names `

and r to denote the collection of rolling window views onto the time series L

and R. That is, ì,j = Li+j for i = i0 + p, . . . , i0 + N − q and for each i the

index j ranges in [−p, q), and with an analogous definition for ri,j . In other words

ì = L[i−p,i+q), ri = R[i−p,i+q). For each i then, the rolling windows ì and ri have

p days of past, and therefore observable, data, and q days of future (or “current”

when j = 0, still interpreted as being in the future though) data.

We want to consider other data series apart from just the VIX price, however,

in the exact same way. Let us consider VIX the index-0 data series, and add in

2

addition the SPX price as the index-1 data series, so that we have the correspond-

ing series L(0) = L as above and L(1) defined analogously for SPX price, and

the other series R and rolling windows ` and r can all have the (0),(1) modifiers

corresponding to VIX, SPX price respectively. Let d = 2 denote the number of

data series considered.

2.2 Hypothesis

We hypothesize that the behaviour of VIX price over the near future few days

can be predicted using the data of VIX price and other related market time series

over the past. One particular stronger hypothesize we test is there being a pattern

relating each p days of past data on all the considered data series to the behaviour

of VIX price over the next q days. In particular, we investigate the predictive

potential for the cumulative return over the future q days of VIX at each time

window. In general we could have chosen any target future quantity thought to

depend largely on market factors that manifest in the chosen data series in the

past. In a similar vein, we can grow the prediction model via considering more data

series thought to contain extra market information pertaining to future VIX prices.

The choice of a fixed length past time window with which to make predictions by

is made with the goal of restricting model complexity and probing the sufficiency

of fairly short history memory in forecasting future return.

3 Method

3.1 Machine Learning Paradigm

First we describe the standard machine learning paradigm, which does not the-

oretically subsume all of the work we do for the application in this paper but

nonetheless provides instructive insight and acts as a good frame of reference for

thinking about other practical prediction methods.

There is a space S of possible input signals and a space T of possible output

targets. The product space S × T has a probability distribution D from which

we can repeatedly sample from. Our goal is to find a function m such that m(s)

and t are expected to be similar according to the similarity metric E when (s, t)

is sampled from D.

In devising to find m, we will use the idea behind the following approach. We

3

hypothesize a family of functions {ma : S → T}, each function ma in which is

characterized by the parameter a in the parameter space A, and is also endowed

with a prior probability c where c(a) denotes how likely we believe ma is the

function out of all the ones in the family to best satisfy our goal when knowing

nothing about D. We take i.i.d. samples from D many times to get two sets

of observations, {(s, t)}train and {(s, t)}cv (cv stands for cross validation). We

choose a so as to balance maximizing c(a) and also optimizing that ma(s) and

t are most similar according to E when (s, t) is drawn from the distribution de-

fined by {(s, t)}train. Then we evaluate if ma(s) and t are similar by E when

(s, t) ∼ {(s, t)}cv. A satisfactory result indicates we can choose m = ma. Call

this procedure a machine learning iteration.

The family of functions must be inclusive enough to contain a function ma

that can satisfactorily meet our goal. On the other hand, the family of functions

should be exclusive enough so as to preclude the possibility of ma being chosen

especially to minimize expected E(ma(s), t) when (s, t) ∼ {(s, t)}train and missing

out on minimizing E(ma(s), t) for when (s, t) ∼ D in general.

The evaluation of how well m′s predicts t using {(s, t)}cv loses its reliable if we

repeat the machine learning iteration with different family definitions, different

c(a), or different methods of search for a across A and pick the best result. This

issue arises because we do not discern which machine learning iteration should give

the best predictive model prior to peeking at the results that the cross validation

dataset gives. The inaccessibility of {(s, t)}cv to training part of the machine

learning iteration is what makes its measurement of ma(s) tracking t useful. Ex-

ecuting a machine learning iteration on a prediction problem, however, can give

crucial insight on designing a better machine learning iteration approach for that

problem. Therefore we find ways to repeat the machine learning iteration with-

out incurring the cost of overfitting to the cross validation dataset with our meta

selection of families of functions and other machine learning iteration parameters.

The same training and cross validation dataset may be used in multiple dif-

ferent machine learning iterations if the number of machine learning iterations is

small. A series of machine learning iterations can be considered as one machine

learning iteration if we acknowledge instead a bigger training set of {(s, t)}train ∪{(s, t)}cv and sample a different cross validation set anew.

The final cross validation set on which the performance of machine learning

approach is evaluated is better known as the test set. The results from this set can

4

be taken as a probabilistic sample on the effectiveness of the resulting prediction

solution.

3.1.1 Financial Time Series

For our application to predicting VIX returns, the input space is all possible values

of a number of real valued time series up to any final day. We can ignore part of

the information in the input and instead consider the input space as all possible

p past days time windows into the data series we have chosen if our prediction

functions only need to take in these p days windows. The output space is either the

value of q days (log) return immediately following each of those p days windows

or some less specific characteristic of that value, like its sign.

We split the physically available collection of time series data up to some

cutoff day into three different sample sets: training, cross validation, and testing.

A free variable i that denotes the day after the latest day present in the inputs

to our machine learning models, being that same variable that indexes time, can

designate identity of an individual sample from the input space, i.e. act as an

index for the windows. Denote the set of i indices represented by each sample

collection, training, cross validation, and testing, as tr, cv, and te, respectively.

We can assume that the distributions of these time windows are independent

across different i to align with the stated machine learning paradigm, but this

assumption is made more for the ease of understanding of the paradigm than for

application requirement. For the sake of practical use of creating a VIX return

prediction model, however, the results of using data from the future to help make

predictions regarding unseen events from the past is not generalizable. Therefore,

all of the i ∈ cv are greater than those in tr, and same for te to cv, i.e. they are

linearly in chronological order. The total number of windows for te consist of the

latest 2 years of data for the test set, amounting to about 7.3% of the data. For

typical machine learning applications the split will not be so skewed, but for our

application, market behaviour can change significantly very quickly, and so being

able to test our algorithm on 2 of the latest years, should mark it as sufficiently

effective.

We review the machine learning paradigm applied to our use case. We define

a family of functions characterized by having different parameterizations which

maps a past time window into a value that seeks to approximate the target future q

days VIX return. Using the input and target samples represented by the index set

5

tr, we select a choice of function from the family, or equivalently choose parameters,

seeking good predictive power on the tr future target values while maintaining

generalizability. We then fix this function and evaluate its predictive power on

the cv input-target samples. Good performance of the selected function from the

family in predicting our target value in the cv set gives an theoretically accountable

indication of the function capturing a predictive relation in the quantities studied,

as the function was chosen without using future information from the cv time

period.1 This process is repeated to select a best family of functions to choose

from and the best learning algorithm for attaining the optimal function within

the family according to prediction performance on the cv set, and then an optimal

function is chosen using all past and future data in tr combined with cv and tested

on the te set to determine a final measure of future predictive potential.

3.2 Machine Learning Concepts

3.2.1 Feature Engineering

A major part of most successful machine learning pipelines is feature engineering.

Features are transformations of the input values into a space that is more easily

related to the target space than the original inputs. We map S to the intermediary

space of features X which is then more easily related to T . Famous deep learning

algorithms of multi-layer neural network models can learn features without manual

human choice given a very large amount (on the order of at least millions) of

training data. In our case, I continue to make creative judgements on what features

to use based on seeing previous results on cross validation.

3.2.2 Ensemble Learning

Ensemble learning leverages the power of multiple learning algorithms used in

possibly distinct contexts in order to more effectively captures useful patterns in

the input. Creating a majority voting committee of base models is a simple way

of aggregating multiple decision algorithms. However, it is not a simple matter

to decide which base models to vote towards the committee decision for a better

combined model. I do not know any standard approaches to this so I try a few

1Note that if the cv time indices consecutively follow those in tr, then the last few tr timewindows future data is also future data for the first few cv time windows, so the parameterlearning process would leak some future information from the cv set. Therefore, some timewindows bordering the adjacent split sets are dropped.

6

methods explained in the results section. Finally, with zero prior intuition about

which of several candidate models might work best on a problem, one solution

is the so called bucket of models approach. Train all the candidate models on

a training set and then pick the one that attains the best score on the cross

validation set to solve the test set problem.

3.3 Machine Learning Primitives

3.3.1 Decision Trees

Decision trees are conceptually very simple. A single branch decision tree might

look like “Is the 5th component of s greater than 0.42? If so, output 0. Other-

wise, output 1.” More complicated decision trees will have more branches and

greater depth of branches. A gradient boosted tree model has multiple trees,

and the resulting prediction is a linear combination of the value outputs of each

individual tree. The term “gradient boosting” refers to the model learning by

iteratively adding on new trees chosen to descend on the gradient of the residual

error left by predictions made by the ensemble of preceding trees. An extremely

popular and industry tested version of a gradient boosted tree algorithm is called

XGBoost. We choose XGBoost as our main workhorse. It is available to be used

as a library[1].

3.4 Specified Model

3.4.1 Input Space and Output Space

Given we only have technical indicators over the past p days, the quintessential

features relevant for prediction must be found in how high or low our numbers

are hovering compared to their historical tendency, and in how these numbers are

moving up and down in the past window. All of this information can be expressed

by the daily log returns data r(k)i,[−p,0) limited to the past p days in addition to

the original past p days price data. Arbitrary linear combinations of these log

returns can represent a wide class of shape descriptors for the movement of the

time series in the past window and certain linear combinations of the log price

can express comparison information between different days not available in just

the log returns data. We hypothesize that machine learning algorithms can learn

a variation of the appropriate linear transformation of the combined information

vector si =⊕

k(r(k)i,[−p,0) ⊕ `

(k)i,[−p,0)) and map them to the target.

7

The time windows model we have does not intrinsically factor day of the week

into consideration. Through experimentation, I have found that separately train-

ing 5 different models for each Monday to Friday weekday gives richer results than

training the same model on all days. We consider a completely distinct model for

each weekday target prediction day and each previous 1–5 days before for the last

day of past data input.

For the target value, I have tried to predict the exact return q days into

the future, but it appears that my method does cannot make useful predictions.

Instead we use the sign of the return at q days in the future for a 2 class (− or

+) classification problem. We test our algorithm’s performance for each future

window from q = 1 to q = 6.

3.4.2 Combinations Features Bank

The primary bottleneck in feature creation is arriving at features that can with

little further effort effectively distinguish the target classes. The approach we take

here follows that principle and uses 3 different varieties of all inclusive combina-

torial patterns of linear, all integer, combinations on n data points. We describe

what the length n arrays of coefficients included in each of these types of combi-

nations consist of.

type C A specific collection of combinations of this variety is parameterized by vari-

ables h and m. These combinations include only arrays a[0,n) such that∑n−1i=0 ai = 0. h indicates the maximum absolute value ai can take and

m indicates the maximum number of entries in a[0,n) that allowed to be

nonzero. A choice of exactly one of a and the array of the negatives of the

values in a to keep in this collection is made.

type I A specific collection of combinations of this variety is parameterized by the

variable k. To begin, consider the case when k = 1. Included are all arrays

a[0,n) with entries 0, 1 only where all the 1s are consecutively placed. More-

over, and only in this specific case of k = 1, the trivial entry where all values

are 0 is not included. For k > 1, the entries consist of adding or subtracting,

decided independently, k arrays from the k = 1 type together. For each of

these arrays a there will be an array whose values are just the negative, i.e.

−a. We have a method to choose exactly one array from a and −a to keep.

Also, we throw away those arrays a whose values have a nontrivial common

factor.

8

type A A specific collection of combinations of this variety is parameterized by the

variables h and m. These combinations include all arrays a where the abso-

lute value of each entry does not exceed h and where at most m entries are

nonzero.

With the intuition of aiming to capture the data of price movement across

past days, `(k)i,[−p,0) is fitted with type C combinations and r

(k)i,[−p,0) is fitted with

type I and type A combinations, separately for each k. Due to computational

limitations, we want to limit n to be possibly smaller than p, so an example set

of features we might use is written explicitly as

xi =

d−1⊕k=0

⊕a∈C(h=1,m=4)

a · `(k)i,[−n,0)

or if we think of C(h = 1,m = 4) as a matrix of coefficient rows instead of a set

of coefficient arrays,

xi =

d−1⊕k=0

C(h = 1,m = 4)`(k)i,[−n,0)

So we have as options for the ith sample vector xi ∈ X of features any vector whose

components are the resulting values when linear combinations from a specific

collection of combinations from each type is applied to the data series that that

type is used with.

Combinations specification number

of combi-

nations

Short description/comments

A(h = 1,m = 1) n the n values in `(k)i,[−n,0) as is

I(k = 1) = C(h = 1,m = 2)(n+12

)all `

(k)i,j1− `

(k)i,j0

where j0 < j1

A(h = 1,m = n) 12 (3n + 1) all {−1, 0, 1} combinations of

r(k)i,[−n,0) (except the trivial 0

combination)

I(k = 1, n = 30) 465

C(h = 1,m = 4, n = 15) 4200 ⊂ I(k = 2, n = 15)

I(k = 2, n = 15) 7260

A(h = 1,m = n, n = 10) 29524

xi is mapped from the space of n real numbers. Some choices of combination

9

sets makes the length of xi significantly exceed n. While that fact alone does

not imply that most of the dimensions in xi are redundant, the relative simplicity

with which we constructed xi from si likely indicates so. Moreover, the number

of training samples for each weekday is only on the order of 103 = 1000, which

the number of constructed features can easily exceed. XGBoost excels at solving

prediction problems where there can be many redundant feature dimensions like

this, where it can quickly and effectively pinpoint the most relevant features to

use.

3.4.3 Hyperparameter Search

XGBoost has a number of manually adjustable settings that can greatly affect its

training and prediction effectiveness. We need to repeatedly run XGBoost using a

wide range of settings. These settings, or hyperparameters, can be thought of as

coming from a subspace of A, the parameter space, in that each instance of these

hyperparameters defines the function ma by specifying how ma is to be constructed

in the training procedure. Our approach to searching this hyperparameter space

is that of randomized grid search. Each hyperparameter has defined a list of

distributions from which to sample from. From these lists we make a grid, so each

grid point defines distributions for each hyperparameter. We sample from these

distributions to construct one search sample of the hyperparameter space. We

now describe the hyperparameters.

� booster This string parameter specifies the type of learning algorithm to

be used in training. “gbtree” is the default original XGBoost algorithm

and “dart” is a dropout analogue for XGBoost, where trees can be dropped

randomly. We use the parameters rate drop=0.1 and skip drop=0.5 for the

tree dropping behaviour.

� n estimators This is an integer that specifies how many trees will be fitted

in the model. The values we use are 10, 30, 100, 200.

� max depth This is an integer that specifies what the maximum depth of

the trees built will be. We use 2, 3, 4, 5.

� learning rate This floating point number is typically set at about 0.1. We

consider 0.1 and 0.5.

10

� subsample This floating point number in (0, 1] is the proportion of training

examples to keep at a learning step. Typically set between 0.5 and 0.9, it

promotes variance so prevents overfitting. The default is 1. We use values

between 0.3 to 0.9 on our different hyperparameter search iterations.

� col subsample Like subsample, but instead this is the proportion of feature

dimensions to keep at each learning step. We use values ranging from 0.003

to 0.7, depending on the number of dimensions in our input features space.

We also try the default value 1 for this hyperparameter, as keeping all of

the features might be more important due to differing semantics of each

generated feature dimension.

� gamma This floating point number is a minimum bound on the training

loss needed to partition deeper in a tree. Increasing this value makes the

model less likely to overfit. We try the default value of 0 and some positive

values.

� (early stopping) XGBoost can be given the option to train with early

stopping enabled, which makes it stopping add trees to the model when

the score on a separate cross validation data set stops improving after some

specified number of rounds. We choose 12 rounds for this threshold.

3.4.4 Cross Validation Techniques

Limitations on amount of data available, i.e. samples of (s, t) from D, make it

fruitful for us to make use of each data point as separate instances of samples.

An example of this is k-fold cross validation. The entire training data is split

into equal groups, numbering in total about 5 – 10 groups. For each group, we

train candidate models on the other groups combined as the training set and

cross validate their performances on that selected group. We average the cross

validation scores on each group to determine overall model performance.

For our application, differing periods of time may have dramatically different

behaviours in price movements. For that reason, I decided to make a 5-fold cross

validation where the cross validation sets consist of 4 four year spans prior to the

final two years of test data, with each newer one is offset forward from the previous

one by two years. And actually the last cross validation set is only a two year

time span, the two year immediately preceding the final two years oftest data.

11

4 Results

4.1 Classification Evaluation

Suppose the target space T is a finite set. For each sample (s, t) ∼ D, we can say s

belongs to the class t. For the cross validation evaluation of a prediction algorithm,

each sample (s, t) ∈ {(s, t)}cv gets a predicted class t̃ = m(s). The target space

in our results are mostly equally likely classes of either negative return or positive

return (labelled − or +).

4.1.1 Confusion Matrix

We can readily see how closely{s 7→ t̃

}cv

follows {s 7→ t}cv by making what is

known as the confusion matrix. The larger the numbers on the diagonal of the

main matrix compared the ones outside the diagonal the better the predictions.

Figure 1: Confusion matrix

Predicted Class0 1 Total

True Class0 (count) (count) (count)1 (count) (count) (count)Total (count) (count) (count)

e.g.

t\t̃ 0 1 All0 3 1 41 2 6 8

All 5 7 12

4.1.2 Precision and Recall

In binary classification, that is, classification with two classes, a standard method

of visualizing the effectiveness of different instances of prediction functions is a

precision-recall curve (also known as ROC curve). However, this visualization

technique requires the choice of one class being considered the “positive” case and

the other the “negative” case, asymmetrically presenting information regarding

either class thereafter. For a classification problem where both classes are equals,

both in proportion of occurrence and in theoretical meaning, we benefit instead

from considering exclusively either the precision or recall of the two classes.

12

Precision, denoted as prc0 and prc1 for classes 0 and 1, is defined as

prc0 =|{t̃ = 0 ∧ t = 0

}cv|

|{t̃ = 0

}cv|

= class 0 precision

prc1 =|{t̃ = 1 ∧ t = 1

}cv|

|{t̃ = 1

}cv|

= class 1 precision

Recall, denoted as rec0 and rec1 for classes 0 and 1, is defined as

rec0 =|{t̃ = 0 ∧ t = 0

}cv|

|{t = 0}cv|= class 0 recall

rec1 =|{t̃ = 1 ∧ t = 1

}cv|

|{t = 1}cv|= class 1 recall

4.1.3 Accuracy

Accuracy, denoted as acc, is defined as

acc =|{t̃ = 0 ∧ t = 0 ∨ t̃ = 1 ∧ t = 1

}cv|

|{all t}cv|= accuracy

It is simply the ratio of correctly classified cross validation samples. For prediction

problems with a class imbalance in number of cases, the accuracy by itself is not

useful, since we would could be 0.9 accurate for a problem where 0.9 of the cross

validation samples are class 0 if we predict everything is in class 0. However, for

our prediction problem of the sign of returns of a financial time series, accuracy is

the most important measurement of performance when considering the practical

value of the prediction solution’s application in trading.

4.2 Sample Procedure

In gaining an idea for which directions to head in order to better tackle the

project’s prediction problem, I have tried many different approaches and meth-

ods, not all of which have a formal structure to it. For the benefit of having a

quantifiable procedure and results, I have one model conceived and designed in its

completeness after exploring the available data in an unstructured way for much

time beforehand.

13

4.2.1 Best Committee of the Best XGBoost Models

The same procedure is done for each separate problem of q days future return

with a particular starting weekday. We sort a collection of XGBoost models

trained on our 5 sections of training data according to their average cross validation

accuracies. There are a total of 10 unique years of cross validation data, and 18

years of simulated freshly encountered cross validation data. This is enough cross

validation that the XGBoost hyperparameter search finds it hard to overfit to the

cross validation sets, notable from the sometimes less than 50% accuracy scores

on some of the validation samples. We try to combine these models to better fit

the span of cross validation sets. We make majority voting committees of size 3,

4, 5 (one model will have two votes in the committee of 4) over some choice of

all the XGBoost models and choose the one that performs best over all the cross

validation sets. The limited choice is mainly due to computational reasons, but if

we limit our search to only amongst the best n cross validated XGBoost models,

we decrease our risk of overfitting this committee combinatorial search to the cross

validation slices. We choose the 10 best cross validated XGBoost models in the

results section.

4.2.2 XGBoost Specifications

The same collection of 608 XGBoost models for each separate problem, specified

by different input features and hyperparameters, are selected with care to have

best hope for capturing the correct patterns for prediction from prior experience.

We use 4 particular feature sets: A(h = 1,m = 1, n = 30), I(k = 1, n = 30), I(k =

2, n = 15), A(h = 1,m = 10, n = 10). The XGBoost parameter grid for each fea-

ture sets is respectively [“gbtree”, “dart”]×[30, 100, 200, (early stop), (early stop)]×[3, 4, 5, 6]×[0.1, 0.5]×[0.75]×[0.6, 0.84]×[0], [“gbtree”, “dart”]×[30, 100, 200, (early stop), (early stop)]×[3, 4, 5, 6]×[0.1, 0.5]×[0.75]×[0.5, 0.7]×[0], [“gbtree”, “dart”]×[30, 100, (early stop)]×[3, 4, 5, 6]×[0.1, 0.5]×[0.75]×[0.1, 0.3, 0.6]×[0], [“gbtree”, “dart”]×[[10]×[4], [10]×[5], [10]×[6], [30]×[3], [30]×[3], [30]×[4], [30]×[5], [100]×[2], [100]×[3], [(early stop)]×[2], [(early stop)]× [3], [(early stop)]× [4], [(early stop)]× [5], ]× [0.1, 0.5]× [0.75]×[0.05, 0.15, 0.5]× [0] where the parameter list is in the order booster, n estimators,

max depth, learning rate, subsample, col subsample, gamma. Each model is given

a randomly and distinctly chosen random number generator starting seed.

14

Figure 2: Committee predictions for one day returns (0 with −)

last Fri to Mon Mon to Tue Tue to Wed

t\t̃ + 0− All+ 38 12 500− 30 18 48All 68 30 98

t\t̃ + 0− All+ 26 27 530− 13 32 45All 39 59 98

t\t̃ + 0− All+ 9 41 500− 12 36 48All 21 77 98

acc = 57 (all in %)prc+ = 56,prc− = 60rec+ = 76, rec− = 38



Wed to Thu Thu to Fri 1 day returns

t\t̃ + 0− All+ 17 27 440− 23 31 54All 40 58 98

t\t̃ + 0− All+ 3 29 320− 2 64 66All 5 93 98

t\t̃ + 0− All+ 93 136 2290− 80 181 261All 173 317 490




4.3 Sample Test Results

Note Figures 2 through 8.

4.4 Interpretation

When a particular subproblem (particular choice of starting and ending weekday

for the prediction period) is showing testing accuracy of less than 55%, it essen-

tially means the best cross validation performing committee model has failed to

generalize appropriately from the data before the test set. We have tried quite a

number of hyperparameters states and extracted a wide range of input features,

which in addition to the consistently 60%+ prediction accuracies for some of the

other subproblems, makes it seem more likely that in these cases the patterns in

the previous years just did not carry over to the most current 2 years.

The relatively consistently strong predictability of Friday’s price direction from

Thursday makes sense as at the end of the week traders want to close their posi-

tions and so may act more predictably. However, the prediction skews significantly

towards the more likely direction. The low precision scores indicate that this does

not necessarily make better discrimination of future behaviour. The growing con-

sistency of good predictions with the time span of the future window shows that

15

Figure 3: Committee predictions for one day returns (0 with +)

last Fri to Mon Mon to Tue Tue to Wed

t\t̃ 0+ − All0+ 56 4 60− 29 9 38All 85 13 98

t\t̃ 0+ − All0+ 25 30 55− 12 31 43All 37 61 98

t\t̃ 0+ − All0+ 11 39 50− 11 37 48All 22 76 98




Wed to Thu Thu to Fri 1 day returns

t\t̃ 0+ − All0+ 22 24 46− 29 23 52All 51 47 98

t\t̃ 0+ − All0+ 10 26 36− 8 54 62All 18 80 98

t\t̃ 0+ − All0+ 124 123 247− 89 154 243All 213 277 490




Figure 4: Committee predictions for two day returns (0 with +)

last Thu to Mon last Fri to Tue Mon to Wed

t\t̃ 0+ − All0+ 26 12 38− 42 18 60All 68 30 98

t\t̃ 0+ − All0+ 47 11 58− 23 17 40All 70 28 98

t\t̃ 0+ − All0+ 26 28 54− 16 28 44All 42 56 98




Tue to Thu Wed to Fri 2 days returns

t\t̃ 0+ − All0+ 15 36 51− 14 33 47All 29 69 98

t\t̃ 0+ − All0+ 2 33 35− 11 52 63All 13 85 98

t\t̃ 0+ − All0+ 116 120 236− 106 148 254All 222 268 490




16

Figure 5: Committee predictions for three day returns (0 with +)

last Wed to Mon last Thu to Tue last Fri to Wed

t\t̃ 0+ − All0+ 24 18 42− 32 24 56All 56 42 98

t\t̃ 0+ − All0+ 29 13 42− 28 28 56All 57 41 98

t\t̃ 0+ − All0+ 42 12 54− 20 24 44All 62 36 98




Mon to Thu Tue to Fri 3 days returns

t\t̃ 0+ − All0+ 26 25 51− 17 30 47All 43 55 98

t\t̃ 0+ − All0+ 6 30 36− 5 57 62All 11 87 98

t\t̃ 0+ − All0+ 127 98 225− 102 163 265All 229 261 490




Figure 6: Committee predictions for four day returns (0 with +)

last Tue to Mon last Wed to Tue last Thu to Wed

t\t̃ 0+ − All0+ 21 22 43− 19 36 55All 40 58 98

t\t̃ 0+ − All0+ 31 14 45− 32 21 53All 63 35 98

t\t̃ 0+ − All0+ 36 11 47− 22 29 51All 58 40 98




last Fri to Thu Mon to Friday 4 days returns

t\t̃ 0+ − All0+ 43 12 55− 20 23 43All 63 35 98

t\t̃ 0+ − All0+ 13 27 40− 11 47 58All 24 74 98

t\t̃ 0+ − All0+ 144 86 230− 104 156 260All 248 242 490




17

Figure 7: Committee predictions for five day returns (0 with +)

last Mon to Mon last Tue to Tue last Wed to Wed

t\t̃ 0+ − All0+ 33 14 47− 25 26 51All 58 40 98

t\t̃ 0+ − All0+ 27 20 47− 18 33 51All 45 53 98

t\t̃ 0+ − All0+ 29 16 45− 22 31 53All 51 47 98




last Thu to Thu last Fri to Fri 5 days returns

t\t̃ 0+ − All0+ 32 14 46− 19 33 52All 51 47 98

t\t̃ 0+ − All0+ 26 15 41− 24 33 57All 50 48 98

t\t̃ 0+ − All0+ 147 79 226− 108 156 264All 255 235 490




Figure 8: Committee predictions for six day returns (0 with +)

last last Friday to Mon last Mon to Tue last Tue to Wed

t\t̃ 0+ − All0+ 39 12 51− 24 23 47All 63 35 98

t\t̃ 0+ − All0+ 30 22 52− 17 29 46All 47 51 98

t\t̃ 0+ − All0+ 24 21 45− 14 39 53All 38 60 98




last Wed to Thu last Thu to Fri 6 days returns

t\t̃ 0+ − All0+ 28 18 46− 21 31 52All 49 49 98

t\t̃ 0+ − All0+ 19 20 39− 15 44 59All 34 64 98

t\t̃ 0+ − All0+ 140 93 233− 91 166 257All 231 259 490




18

high frequency price signals are more difficult to algorithmically find than lower

frequency signals.

5 Discussion

5.1 Discovery Process

This section is an account of the process of coming up with the final model.

To begin, the time series in our work was not filled in the missing weekdays

and the prediction problem was not separated for different starting weekdays. So

for each q = 1 . . . 5, I sought to find how well I can predict the q days return, or

just its sign, regardless of what the current day is. And holidays or other days

when prices were not recorded did not count as a day in the time series. Also, the

data for the time series only went back to 2004.

I tried simpler models of multilayer perceptrons or XGBoost trained on just

the log daily returns data on a 30 days past time window for example, and they

could not predict the future days return or the sign of the return better than

chance. Slightly more hopeful seeming results came from trying to predict the

sign of 3 days return, so I used this prediction task to search for models that have

potential.

I came up with the idea of using the A(h = 1,m = n) features to better

capture the information hidden in the slopes of price movements. n = 12 was as

high as computationally feasible, and n = 6 led to faster training time, so I had

run parameter searches for XGBoost using these features. Through the parameter

searching, I gained an idea of which XGBoost learning hyperparameters mattered

and optimal value ranges for the ones that did matter. I also found that using

the prices of both VIX and SPX gave the best results, over including additional

features from the volume data of the SPX or taking only one of the price time

series. I was able to attain about 60% accuracy with fairly balanced precision and

recall on q = 3, even attaining 59.5% accuracy on a test set of about the latest

2.5 years.

After not being able to improve the cross validation accuracy despite spending

much time hyperparameter searching, I thought of how differing weekdays might

systematically affect the behaviour of price movements. I filled in missing values to

be the last available price and separated the prediction windows by their starting

weekday. Immediately the cross validation scores jumped to about 70%. I did

REFERENCES 19

more parameter searching and tried majority vote committee ensembles of the

XGBoost models, which also improved the cross validation accuracies significantly,

passing 75%.

However, now came the problem of overfitting. It turns out that I searched too

many hyperparameters for a too small cross validation set, as the test scores were

erratic and not better than chance. So I added in more cross validation, and after

some tuning, arrived at the model included here. I also tried another method

selecting the committee candidates: take the top n XGBoost models, where n

is any odd number. This worked significantly poorly compared to the current

committee choices method on the final test set.

5.2 Limitations and Extensions

I note that I looked at results of some of my previous models on the current test set

multiple times before arriving at my current model. This should not significantly

degrade the predictive power implied by the modeling process presented here as I

did not systematically mine the test set for patterns.

The histograms of the returns on successfully predicted days do not differ in

any visible way from the histograms of the returns on unsuccessfully predicted

days. The application of the results here for trading will require an improve-

ment overhaul of the current cross validation model to be more certain that these

predictions are reliable before finding use in backtesting for possible profits from

trading using these predictions.

References

[1] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting sys-

tem”. In: Proceedings of the 22nd acm sigkdd international conference on

knowledge discovery and data mining. ACM. 2016, pp. 785–794.

[2] Chicago Board Options Exchange. “The CBOE volatility index-VIX”. In:

White Paper (2009), pp. 1–23.

predicting the volatility index returns using machine learning · 2017. 11. 6. · predicting the...

Documents