statistical models and machine learning algorithms to ...€¦ · stock market news are everywhere:...

Statistical Models and Machine Learning Algorithms toForecast Future Prices in the Stock Market

Ana Rita Silveira da Costa

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor(s): Prof. Nuno Cavaco Gomes HortaProf. Rui Fuentecilla Maia Ferreira Neves

Examination Committee

Chairperson: Prof. António Manuel Raminhos Cordeiro GriloSupervisor: Prof. Nuno Cavaco Gomes Horta

Member of the Committee: Prof. Alexandra Sofia Martins de Carvalho

June 2018

Declaration

I declare that this document is an original work of my own authorship and that it fulfills all the require-

ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.

iii

Acknowledgments

Firstly, I would like to thank my supervisor Professor Nuno Cavaco Gomes Horta for the support and

knowledge he gave me during the development of this thesis. I would also like to thank my family,

specially my parents, who have always supported me along the whole thesis process and gave me the

opportunity to study in such a good university. Finally, a very special acknowledgment to Joao Salvado,

Ines Gil, Ines Goncalves, Joao Villa de Brito and Vera Pedras, who helped me not only during the thesis

development but also trough out the whole degree.

v

Resumo

Os precos de acoes podem ser interpretados como series temporais que podem ser previstas, de forma

a melhorar os resultados para um investidor. Varios metodos encontram-se em desenvolvimento com

o objetivo de obter uma previsao mais precisa. A previsao de uma serie temporal e um problema de

regressao, visto ser uma variavel contınua que e prevista. A presente dissertacao aplica um metodo

estatıstico, ARIMA, e dois de machine learning, K-Nearest Neighbors (KNN) e Support Vector Regres-

sion (SVR), com vista a prever o preco das acoes. O presente trabalho apresenta previsoes diarias,

semanais e mensais, fazendo uso de acoes com diferentes caracterısticas. Os tres modelos estudados

sao comparados em cada uma das situacoes referidas, considerando o erro das previsoes, os retornos

de uma estrategia simples e ainda o risco e precisao da estrategia. Os dados utilizados para o perıodo

de treino correspondem a 4 anos de uma acao com uma tendencia clara e outra acao com tendencia

lateral. O perıodo de teste corresponde a 1 ano das mesmas acoes. O melhor resultado foi obtido com

o ARIMA numa previsao mensal, alcancando retornos de 40% e uma precisao de 90.9%. Os algoritmos

KNN e SVR demonstraram ser mais precisos em acoes de tendencia lateral, sendo as solucoes destes

superiores as solucoes obtidas com o ARIMA. Ambas as abordagens de machine learning beneficiam

da introducao de um retreino durante o perıodo de teste, tendo em alguns casos decrescido o erro em

10 vezes.

Palavras-chave: Series Temporais, Analise Preditiva, Stock Market, ARIMA, K-Nearest

Neighbors, Support Vector Regression

vii

Abstract

Stock prices can be interpreted as time series that can be forecasted in order to improve the returns of

a trader. Several methods including statistics and artificial intelligence are being developed in order to

turn this prediction more accurate and reliable. Forecasting a time series is a regression problem since

it is a continuous variable that is being forecasted. This thesis applies a statistical method, ARIMA, and

two machine learning models, K-Nearest Neighbors and Support Vector Regression, in order to forecast

future stock prices. The presented work shows predictions in a daily, weekly, and monthly range using

different stocks with different characteristics. The three studied models are compared in each of these

situations considering the error of the forecasted values, the returns of a strategy that relies on these

predictions and the risk and accuracy of that strategy. The data sets that were used for the training

period correspond to 4 years of data of a clear trend stock and a sideways stock in order to present data

with different characteristics. The test period corresponds to 1 year of the same stocks. The best result

obtained was by the ARIMA model in a monthly forecast, reaching returns of 40% and an accuracy

of 90.9%. The K-Nearest Neighbors and Support Vector Regression algorithms are more precise in a

sideways stock being superior to the ARIMA solution. Both machine learning approaches benefit from

the introduction of a retraining during the test period, in some cases decreasing the error in 10 times.

Keywords: Time Series, Forecast, Stock Market, ARIMA, K-Nearest Neighbors, Support Vector

Regression

ix

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 5

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Concepts of Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Time Series characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Modeling and Forecasting Time Series . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Works about modeling and forecasting a time series . . . . . . . . . . . . . . . . . 17

2.2.2 Works on forecast concerning Big Data . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.3 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Proposed Architecture 25

3.1 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Architecture Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Data Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Train Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Forecast and validation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.4 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results 37

4.1 ARIMA Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xi

4.1.1 Stock with a Clear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.2 Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.3 ARIMA performance conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 K-Nearest Neighbors Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Clear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Sideways stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.3 KNN performance conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Support Vector Regression Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 Clear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.2 Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.3 Support Vector Regression performance conclusions . . . . . . . . . . . . . . . . 51

4.4 ARIMA vs. KNN vs. SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.1 Clear Trend Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.3 Overall comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Studying the impact of retraining KNN and SVR . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Conclusions and Future Work 59

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 61

xii

List of Tables

2.1 Description of the most common metrics used in statistical works . . . . . . . . . . . . . . 15

2.2 Description of the most common metrics used in computational finance . . . . . . . . . . 16

2.3 Algorithm comparison based on the Related Work . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Resume of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 ARIMA parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 SVR parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 KNN parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 ARIMA data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 ARIMA results for a clear trend stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 ARIMA results for a sideways stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 ARIMA Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 K-Nearest Neighbors results for a clear trend stock . . . . . . . . . . . . . . . . . . . . . . 46

4.6 K-Nearest Neighbors results for a sideways stock . . . . . . . . . . . . . . . . . . . . . . . 47

4.7 KNN performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8 KNN MAPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 Support Vector Regression results for a clear trend stock . . . . . . . . . . . . . . . . . . 50

4.10 Support Vector Regression results for a sideways stock . . . . . . . . . . . . . . . . . . . 50

4.11 Support Vector Regression performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.12 ARIMA vs. KNN vs. SVR in a Clear Trend Stock. . . . . . . . . . . . . . . . . . . . . . . . 52

4.13 ARIMA vs. KNN vs. SVR in a Sideways Stock. . . . . . . . . . . . . . . . . . . . . . . . . 54

4.14 Best results for each stock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xiii

List of Figures

1.1 Problem to be solved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Example of a time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Example of PACF and ACF functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 SVM classification approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 ε-insensitive loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Representation of the problem architecture (adapted from [19]) . . . . . . . . . . . . . . . 21

3.1 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Pseudo-code for the transformation of a .csv file into a time series . . . . . . . . . . . . . 27

3.3 Overfitting and Underfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Data separation into Training and Test sets . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Pseudo-code for the ARIMA implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 Transformation of a time series into a supervised learning format. . . . . . . . . . . . . . . 31

3.7 Cross-Validation with K=3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8 Pseudo-code for feature selection and hyper-parameters tuning . . . . . . . . . . . . . . 32

3.9 MAE calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.10 ROI calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.11 Sharpe Ratio calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.12 Accuracy calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.13 Evaluation metrics calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 VRSN Stock Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 BEN Stock Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Results of Dickey-Fuller Test for the original series . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Results of Dickey-Fuller Test for the stationary series . . . . . . . . . . . . . . . . . . . . . 40

4.5 Results of Dickey-Fuller Test of the original series . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 Results of Dickey-Fuller Test of stationary series . . . . . . . . . . . . . . . . . . . . . . . 43

4.7 Comparison of the three algorithms in a clear trend stock . . . . . . . . . . . . . . . . . . 53

4.8 Comparison of the three algorithms in a sideways stock . . . . . . . . . . . . . . . . . . . 54

4.9 KNN with Retraining for a Clear Trend Stock . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.10 KNN with Retraining for a Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xv

4.11 SVR with Retraining for a Clear Trend Stock . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.12 SVR with Retraining for a Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xvi

Chapter 1

Introduction

This chapter describes the motivation behind this work and the problem to be solved. After understand-

ing the context of the problem, the objectives of this thesis are enumerated as well as a few contributions

resulting from all the conducted research. At the end of the chapter, the document structure is described.

1.1 Motivation

Stock market refers to the collection of markets and exchanges where trading securities takes place and

it is considered one of the most vital components of a free-market economy. It is known that the first stock

exchange happened in 1531 in Belgium, even though the concept of “stock” has changed over time, in

the beginning it was similar to a financier partnership that produced income like stocks do. The stock

exchange started in London officially in 1973, and 19 years after was the first New York Stock Exchange.

Stock market news are everywhere: newspapers, TV news, and there are complete websites dedicated

to this matter. The reason why the stock market is so important is that allows companies to raise money

by offering part of their equity, letting the investors participate in their financial achievements. Also, stock

market serves as an economic barometer since share prices rise and fall depending largely on economic

factors. For example, share prices tend to increase when the economy shows signs of growth, and in

the other hand tend to brutally decrease, sometimes leading to a stock market crash, during economic

recession, depression or financial crisis. Having a good knowledge about the stock indexes serves as

a reference to the general trend in the economy, influencing decisions from the average family to the

wealthiest executive.

Going back to the fact that stock market gives the opportunity to small investors participate of the

company financial achievements, the idea of owning shares of a big company is very appealing, leading

loads of people to invest in this market. Taking this into account, the investor wants to own shares of a

wealthy company with expectations of future returns, and not of a company that will decrease its value

in the future. This lead to the need of pondering the decision of which stock one should invest. To

solve this impasse, the concept of predictive analysis, also known as forecast, started to appear in the

financial field. The definition of “forecast” is to predict or estimate a future event or trend based on past

1

and present information. In this case, this future event is a future value of a share, in order to decide

either to buy it or not. The idea of learning from the past in order to predict the future gained popularity

in the last years and a lot of techniques concerning this topic are now being used and tested to make a

good prediction helping those involved in this market.

There are statistical models to do this predictive task and they started to give very good results.

Statistics deals with the collection, classification, analysis, and interpretation of numerical data and

provides tools for forecasting through statistical models. Statistical models started to be almost all from

the class of linear models but the data behavior, especially financial data, caused a particular interest in

nonlinear models. Also because of the non-linearity of the data, artificial intelligence methods gained a

lot of popularity in the forecasting field. The term “Machine Learning” is being widely used in financial

computation, although the origins of machine learning are from the 50’s. In 1952, Arthur Samuel wrote

the first computer learning program, a checkers game where the IBM computer improved the more it

played, studying which moves made up winning strategies and using it into its program to win. It was one

of the first time that was created a kind of an “artificial intelligence”. In 1957, Frank Rosenblatt designed

the first neural network, the beginning of one of the most powerful machine learning algorithms used

nowadays. During the 90’s, machine learning shifts from a knowledge-driven to a data-driven approach.

Programs were being created for computers to analyze large amounts of data and draw conclusions

— or “learn” — from the results. Today we can even talk about “deep learning”, the ability to see and

distinguish objects and text in images and videos [1]. Computers’ abilities to see, understand, and

interact with the world around them is growing at a remarkable rate, and a lot of traders are using this

ability to forecast in the stock market.

This work is then motivated by the challenge of conducting predictive analysis in the stock market

using statistic and machine learning algorithms to improve the returns of a trader.

1.2 Objectives

The main purpose of this work is to introduce the use of forecast tools in financial computation field,

implementing a statistical algorithm and two machine learning algorithms to predict future prices of a

share in the stock market. After the implementation, the results of each model will be evaluated with

different metrics and compared based on that evaluation. The tests will be conducted with different

data volumes and for different time frames. At the end of this work, it will be possible to see how these

models behave and to choose the best of them for different situations. Briefly, the main objectives are

enumerated below and illustrated in Figure 1.1:

1. Study time series and their role in the stock market;

2. Introduce the use of forecast tools in financial computation;

3. Use one statistical algorithm to forecast future prices of a share in the stock market;

4. Use two machine learning algorithms to forecast future prices of a share in the stock market;

2

5. Compare the behavior of the three different models for a daily, weekly and monthly forecast.

Figure 1.1: Problem to be solved and questions related to each one of the main topics.

This work pretends to give a contribution in the field of statistical and machine learning algorithms for

forecasting financial data and the main contributions of this research are the following:

1. Give an idea of how to evaluate a forecast algorithm in financial computation;

2. Show the behavior of different algorithms when forecasting in the stock market;

3. Serve as a base for a future Big Data platform integration;

4. Serve a base to future forecast tests in financial computation.

1.3 Document Structure

This document contains five chapters in its structure. Chapter 1 is the introduction of the problem

being solved in this work as well as the objectives definition. Chapter 2 is the background and related

work. In its first part introduces some theoretical background about the context of the problem (stock

markets) and the algorithms used along the work; in its second part presents works related to the

forecast topic, most of them in the stock market’s context. Chapter 3 contains a description of the

proposed architecture to solve the problem of forecasting in the stock market and the logic behind its

step by step implementation. Chapter 4 contains some case studies serving as an evaluation context to

each of the implemented algorithms. It is also in this chapter that the algorithms are evaluated (for each

case study) and compared to each other. The last chapter, Chapter 5, contains conclusions about the

conducted work and some thoughts about future work that can be done in this field of research.

3

Chapter 2

Background and Related Work

2.1 Background

In order to make a better understanding of the problem, this section intends to describe some important

terms, techniques, and technologies. Going along the problem definition, there are three main topics:

the stock market data as a starting point, the forecasting techniques that can be used with this data and

finally the results discussion of this application. The questions enumerated in the objectives, Section

1.3, Figure 1.1, are answered in this section.

2.1.1 Concepts of Stock Market

The stock market is all about companies. Companies have assets, being an asset something that

has value and gives some type of future benefit. In a company context, assets can be the sum of

cash, buildings, inventories, copyrights, etc. On the other hand, companies have liabilities, a quantity of

money that they owe to some entity. The remaining between the assets and liabilities, in other words,

what is left after paying the liabilities, is what is called “owner’s equity”. When someone buys a share,

he or she becomes a partial owner of the company, more specifically, part-owner of the owner’s equity.

For future clarification, the difference between the terms “share” and “stock” is that a share is referring

to a specific company, while stocks may refer to one or more companies. Supposing a company has

a certain number of shares, the value of each share is the value of the owner’s equity divided by the

number of shares, and it is this share that is being sold in the stock market. To better understand this

scenario, an example is given below. Company X has $30m of assets, $22m of liabilities, $8m of owner’s

equity and 2m shares. The value of each share is the value of the owner’s equity divided by the number

of shares, so it is $4. Very often this value does not correspond to the selling price of the same share

in the market: sometimes the market presents a higher value, other times a smaller value because of

speculation among other reasons. The “market capital” is what the market thinks the equity is, reflecting

it in share prices.

Imagining that, for some reason, a trader can make an informed guess about what will be the price

of a specific share in the day after. The question here is: how can a trader make money with this

5

information? Supposing that the trader has strong reasons to believe IBM shares will increase tomorrow

and with that information he or she decides to buy one or more IBM shares. If the prediction is right, on

the day after the trader owns a share that has more value than when was bought. This action is called

a “long position”, and it is commonly addressed as “going long”. When owning this shares, if the trader

has strong reasons to believe the price will start to decrease, he or she can sell its shares and end its

long position with a positive profit, since the share was sold for a higher price than it was bought.

If instead the trader has strong reasons to believe that IBM shares will fall down the day after, for

example from $100 to $50, and the trader does not own any share of IBM, he or she can borrow a share

from the broker and sell it on the market for its current value, $100. After this step, the trader owns $100

in cash and owes one share to the broker, since the sold share was borrowed. If the share price drops

to $50 in the day after, the trader can now buy this share and return it to the broker, having a profit of

$50 at the end of the trade. This is what is called a “shorting a stock”.

Resuming, in a long position the main goal is to buy a share by a low price to sell it at a higher price,

and in a short position, the goal is to sell high to buy low.

This strong guesses that traders have concerning future share values are the base of trading strate-

gies. There are two main trading strategies influencers used nowadays – fundamental and technical

analysis. Fundamental analysis focuses on the fundamentals of the company or industry, i.e. data such

as sales and debt level, which of course are affected by the macroeconomic environment. It tries to go

to the facts and numbers of each company. This kind of research uses economic reports, internal doc-

uments, and even public news. Technical analysis is a method of analyzing the statistics generated by

market activity to evaluate securities. It relies on three hypothesis 1) the market discounts everything, 2)

price moves in trends and 3) history tends to repeat itself [2]. Technical indicators are operations based

on the price and volume of a security that measures money flow, trends, volatility, and momentum.

Stock market price sequences are available on multiple web platforms like Yahoo, Google Finance,

or OANDA, and there is an enormous quantity of data available. This data can be interpreted as Time

Series and it is important to understand some specificities of time series to know how to use them for

data analysis and forecast.

2.1.2 Time Series characteristics

A univariate time series is a sequence of measurements of the same variable collected over time where

ordering matters due to the dependency over the past. Figure 2.1 shows an example of a time series.

Time series can have multiple behaviors such as trends, seasonal periods and they can even show

a random walk behavior. The existence of a trend means that, on average, the variable measured tend

to increase or decrease. One example of this can be the number of people using cellphones measured

in the last 10 years. On the other hand, seasonality is a regular pattern related to calendar seasons

and can be observed, for example, in time series representing the percentage of rain in the last 5 years.

Some measurements seem to have a random walk behavior, almost like white noise. Most of them can

6

Figure 2.1: Example of a time series.

be decomposed in order to identify if they have some trend or seasonality that is not obvious just by

looking [3].

Identifying these characteristics is very important when considering statistical analysis since station-

ary time series are easier to work with. A stationary time series is one with constant mean and variance.

With this definition, it is possible to conclude that an uptrend time series is not stationary.

Modeling and analyzing a time series is nothing more than finding a mathematical model that can

describe the time series values over time. This analysis can perhaps explain how the past is affecting

the present and the future of the values, to forecast future values and to serve as a control standard.

2.1.3 Modeling and Forecasting Time Series

Forecasting is the process of making predictions about the future based on data from the past. Stock

market traders use forecasting to predict the evolution of stock prices and take advantage of it to decide

a strategy, meaning either going long or short on a share. This can be made without any computer

program, being the economists the ones who analyze all sources of data trying to find relations and

patterns that are sufficient to make assumptions about the future. This approach became impractical

due to the amount of data that exists so a lot of methodologies were computed to facilitate this process.

A forecast is a prediction so it has a degree of risk and uncertainty attached to it. There are two main

forecasting approaches discussed in this work: statistical approach and machine learning approach.

The three methods described in this section are inserted in one of these two categories and enumerated

below:

1. Statistical approach: ARIMA model.

2. Machine Learning approach: Support Vector Machines and K-Nearest Neighbors.

a) Statistical Approach

A time series can be modeled and forecasted with a statistical equation. The more common statistical

methods are GARCH (generalized autoregressive conditional heteroscedasticity) and ARIMA (autore-

7

gressive integrated moving average). The first one is used mainly to forecast financial time series volatil-

ity, i.e. the periodic standard deviation. ARIMA fits the data itself and not the volatility and it is used to

forecast the actual time series [3]. For effects of this work, only the ARIMA model will be described in

this section.

a.1) ARIMA

ARIMA is a combination of an autoregressive model (AR) and a moving average model (MA) with one

or more orders of difference. ARIMA has three parameters, p, d, and q, and it is commonly presented

as ARIMA (p,d,q).

Starting by explaining what is an autoregressive model, it is important to understand what is an

autoregression. Autoregression is nothing more than a regression of the variable against itself, and the

expression of an autoregressive model is shown in Equation (2.1),

yt(AR) = µ+

p∑i=1

γiyt−i + εt. (2.1)

In this equation, yt(AR) is the variable to be predicted by the autoregressive model, µ is the average

of the changes between consecutive observations, γi are the coefficients of the lagged value, yt−i are

the lagged values of yt, p is the number of these lagged values, and εt is white noise. Multiple regression

uses a linear combination of predictors to forecast the variable of interest. Looking at the autoregressive

model equation, Equation (2.1), it is possible to observe that an autoregression is similar to a multiple

regression but with lagged values of yt as predictors. The number of lagged values that are used as

predictors is the value of p, being p one of the ARIMA parameters.

While an autoregressive model (AR) uses past values of the forecast variable in a regression, a

moving average (MA) model uses past forecast errors in a regression-like model. The expression of a

moving average model is presented in Equation (2.2),

yt(MA) = µ+ εt +

q∑i=1

θiεt−i. (2.2)

In Equation (2.2), µ is the average of the changes between consecutive observations, εt is white

noise, θi are the coefficients of the lagged forecast errors and q is the number of these lagged errors.

Looking at this equation, it is possible to observe that each value of yt can be thought of as a weighted

moving average of the past few forecast errors. The number of these past forecast errors, as referenced

before, is the parameter q of the ARIMA model.

The combination of differencing with these last two equations results in the ARIMA (p,d,q) expression

since ARIMA is an autoregressive integrated moving average model. The difference in a time series is

the series of changes from one period to the next. The ARIMA (p,d,q) expression, Equation (2.3), is

presented below where y′t represents the differenced series that may have been differenced more than

once:

8

y′t(ARIMA) = µ+

p∑i=1

γiy′t−i + εt +

q∑i=1

θiεt−i. (2.3)

The number of time that the series was differenced is the value of parameter d of ARIMA (p,d,q).

The reason why some series need one or more orders of difference is that they are not stationary and

ARIMA only works with stationary series. If after one order of difference the series become stationary,

the value of parameter d is 1”. If the original series is already stationary, there is no need of differencing.

Concluding, the description of the three parameters are resumed below:

1. p is the number of autoregressive terms;

2. d is the number of orders of difference to turn a time series stationary;

3. q is the number of moving average terms.

To forecast a time series with ARIMA, the values o p, d, and q should be calculated.

First thing when thinking about modeling a time series with ARIMA is to check the series stationarity,

i.e., if it has a constant mean and variance. A nonlinear transformation, for example a logarithmic

transformation, can convert the original series to a form where its local random variations have constant

variance over time. After this step, if the time series is still non-stationary, a first-difference transformation

can be applied until the series shows a constant mean. The number of orders of difference that is

needed to turn the original series into a stationary one is the parameter d, and its value is discovered in

the phase.

Now that the series is stationary and the order of differences, d, is known, there are still two pa-

rameters to discover: the parameter p that is the number of autoregressive terms, and the q moving

averages terms. To identify the number of AR and MA terms, a PACF (Partial Autocorrelation Function)

and an ACF (Autocorrelation Function) can be observed. One example of ACF and PACF function are

presented in Figure 2.2.

Figure 2.2: Example of PACF and ACF functions.

PACF and ACF are both measures of association between current and past series values. In the

ACF is presented the relation between yt and its lagged values yt−k, for different values of k. In the ACF

9

perspective, if yt is correlated with yt−1, then yt−1 and yt−2 are also correlated. This correlation may be

due to new information contained in yt−2 that could be used in forecasting the value of yt, or it can be

simply because they are both connected to yt−1. To solve this uncertainty, a PACF is conducted, since

PACF measure the relationship between yt and yt−k after removing the effects of lags 1, 2, 3, ..., k − 1.

Looking at Figure 2.2, the ACF shows a spike in the first lag and PACF shows the same, so is very

reasonable to believe that yt is strongly correlated with yt−1. Generally, the lag beyond which the PACF

cuts off is the indicated number of AR terms, and the lag beyond which the ACF cuts off is the indicated

number of MA terms.

b) Machine Learning Approach

The widely-quoted definition of Machine learning by Tom Mitchell [4] says “A computer program is said

to learn from experience E with respect to some class of tasks T and performance measure P if its

performance at tasks in T, as measured by P, improves with experience E”. Machine Learning algorithms

can then simulate a brain, in the way that given a set of information, algorithms can interpret and process

the information and take out some conclusions. In forecast, these algorithms are gaining a special place,

with very good results in a lot of fields. The idea is to use as an input of the algorithm a set of data (a

time series for example), train the algorithm with that data and present an expectation of future values

that this time series can take.

There are several types of machine learning algorithms being the main groups the supervised and the

unsupervised ones. Supervised Learning is a mapping function between input variables, also known as

features, and output variables, also known as targets, both of them known in the beginning [5], with the

goal of finding the relation/pattern between the input and output. The ideal result is a perfect matching

between these values in a way that given a new input set it is possible to have a reliable output. It is

called supervised because the correct output is known since the beginning and all the process can be

seen as successive attempts to get it right, being constantly adjusted to achieve the perfect mapping.

Supervised Learning problems can be grouped into classification problems when the output is a

category, and regression problems when the output variable is a real or continuous value.

b.1) K-Nearest Neighbors

When talking about forecasting, a lot of algorithms come out. One of the simplest is the Nearest Neigh-

bor algorithm. K-Nearest Neighbor (KNN) is a non-parametric method since it does not assume a linear

functional form for f(X), being more flexible than a parametric model [6] since it does not make any

assumptions on the underlying data distribution. Non-parametric models can be more complex to un-

derstand but they work better than linear models when dealing with a great number of observations.

Besides being a non-parametric method, is also called a lazy algorithm meaning it does not use the

training data points to do any generalization being the training phase very minimal and fast. KNN al-

gorithm is based on feature similarity and in regression is based on how much of the previous values

are considered similar to the out-of-sample data point. This number of previous values is the number of

10

neighbors.

Assuming a value for the number of nearest neighbors, K, and a point to be predicted at an instant

i, xi, the KNN algorithm identifies the K training observations Ni closest to the prediction point. The

estimation for xi is given by:

f(xi) =1

K

∑xiεNi

yi. (2.4)

In other words, to predict the value at time t, for example for a K = 10, the algorithm does the average

of the last ten values and assumes that this is the value at instant t. The optimal value for K depends

on the bias-variance tradeoff. Bias measures how far models predictions are from reality and variance

represents the variability of a model prediction for a given data point. A small K provides a more

flexible fit, which has a low bias but a high variance. On the other hand, a large value of K provides

a low variance since the prediction in a region is an average of several neighbors, and changing one

observation has a smaller effect.

The main advantages of KNN are the simplicity, high accuracy, quickness, and the fact that does not

make assumptions about the data. On the other hand, the prediction phase can be slow, and it can be

sensitive to irrelevant features.

b.2) Support Vector Regression

Support Vector Machines (SVM) are supervised learning algorithms used for classification and regres-

sion. Due to the context of this problem, Support Vector Regression is the one that matters, since

the goal is to predict a real value, but it is important to understand the way of work of support vector

machines in general. Support Vector Machines were invented by Vladimir N. Vapnik and Alexey Ya.

Chervonenkis in 1963 and since that time it has suffered some changes. The SVM algorithm is also

known as the widest street approach and the basic idea is described below.

Figure 2.3: SVM classification approach.

11

It is given a training dataset ((−→x1, y1), ..., (−→xn, yn)), where yi is the class to which −→xi belongs. Support

Vector Machine intends to find the maximum-margin hyperplane that divides the group of points −→xi that

belong to one class and the group of points −→xi that belong to the other class. Taking the example from

Figure 2.3 the −→xi values can be health indicators and yi can be the result of a health test for those health

indicators. The hyperplane then divides the group of health indicators that belong to type “positive for

disease” from the other group of indicators belonging to type “negative for disease”. The hyperplane is

then the decision boundary and it is as wide as possible, knowing it cannot contain any sample. There

are some samples on the margin and they are called the support vectors.

The hyperplane is then the decision boundary, but it is necessary to define a decision rule that uses

this decision boundary. In Figure 2.3, vector −→w is a vector of any length perpendicular to the median

line of the hyperplane, u is an unknown sample and −→u is the vector representative of that sample and

the decision rule has to give either if the unknown sample is on the left or the right side of the “street”.

Projecting −→u in the perpendicular vector −→w can give that information, meaning that if the dot product

between the two is greater than a constant, the sample belongs to the right side of the hyperplane, so if−→w · −→u ≥ c, u is a positive sample. Without loss of generality, and assuming c = −b, it is possible to say

that if the dot product plus some constant b is equal or greater than zero, −→w · −→u + b ≥ 0, u is a positive

sample. The problem with this equation is that there are no constraints to determine which particular −→w

and b to choose, so it necessary to add some constraints to calculate these values. Taking one positive

sample, −→x +, it is acceptable to insist that its dot product with −→w plus a constant has to be greater than

1,−→w · −→x + + b ≥ 1, and taking a negative sample, its dot product with −→w plus a constant has to be less

than -1, −→w · −→x − + b ≤ −1. This assumption is due to mathematical convenience.

With this last two equations, it is possible to calculate the two values of −→w and b but it is still a long

calculation, so another variable, yi, is introduced into the problem for mathematical convenience. This

variable takes the following values: 1 for positive samples and -1 for negative samples. Multiplying the

two last equations by yi, the two equations become equal to yi (−→w · −→x i + b) ≥ 1 that is equivalent to

yi (−→w · −→x i + b)− 1 ≥ 0, being yi (−→w · −→x i + b)− 1 = 0 for the samples above the margins.

Choosing one vector to a negative sample and another to a positive sample, the width of the street

could be described as the dot product between the difference of these two vectors, −→x+−−→x −, and a unit

vector perpendicular to the “street”. It is known that −→w is a perpendicular vector of any length, so divided

by its norm is a unit vector perpendicular to the hyperplane. The width is then equal to (−→x+ −−→x −)−→w‖w‖ .

When looking at this equation, −→x + · −→w and −→x − · −→w can be solved using yi (−→w · −→x i + b) − 1 = 0 being−→x + · −→ω = 1 − b and −→x − · −→w = 1 + b. Substituting this results in (−→x+ − −→x −)

−→w‖w‖ , the width of the

street is now updated to 2‖w‖ . The goal of all these calculations is to maximize the width of the decision

boundary and find the decision rule, in other words, maximize 2‖w‖ . For mathematical convenience, this

is the same as maximizing 1‖w‖ and the same as minimizing ‖w‖ and formally the problem of maximizing

the width of the hyperplane is mathematically the same as minimizing 12‖w‖

2. The original problem of

maximizing the width of the hyperplane is now reduced to Equation (2.5):

Minimize1

2‖w‖2 . (2.5)

12

Finding one extreme of a function with constraints is not obvious, and to solve this problem Vapnik

used Lagrange multipliers. The use of this multipliers gives a new equation to minimize without thinking

about the constraints of the problem. L is now what it has to be minimized and it is written in Equation

(2.6):

L =1

2‖−→w ‖ −

∑αi [yi (

−→w · −→xi + b)− 1]. (2.6)

To find an extremum of a function one should equal the derivatives of that function to zero, so in this

case, the derivatives of L in order to −→w and b should be equalized to zero and it is shown below:

dL

dw= w −

∑αiyixi = 0⇒ w =

∑αiyixi. (2.7)

dL

db=

∑αiyi = 0. (2.8)

Substituting (2.7) and (2.8) in (2.6) the result is the Equation (2.9):

L =∑

αi −1

2

∑αiαjyiyj(xi · xj). (2.9)

The conclusion is that the optimization depends only on the dot product of pairs of samples. Taking

this into consideration, the decision rule is given by the two equations below and it only depends on the

dot product of the sample vectors and the unknown sample vector:

If∑

αiyi−→xi · −→u + b ≥ 0, then it is a positive sample. (2.10)

If∑

αiyi−→xi · −→u + b ≥ 0, then it is a negative sample. (2.11)

The original support vector machines proposed by Vapnik in 1963 only covered situations where

it was possible to linearly separate the two classes of samples. To turn the algorithm able to solve

nonlinear problems, Vapnik allied a kernel trick and the dot product is then replaced by a nonlinear

kernel function allowing the algorithm to fit the hyperplane in a transformed feature space. The more

common kernels besides the linear one are the polynomial and the Gaussian radial basis.

The idea of using SVM for regression, more known as support vector regression, is very similar to

the idea of SVM for classification. In SVM version for classification, the model depends only on a subset

of samples because the model does not care about points out of the hyperplane margins. Analogously

in regression, the model depends only on a subset of the training data because the cost function for

building the model does not care about points close to the model prediction. Formally this problem can

be written as a convex optimization:

13

Minimize1

2‖w‖2

subject to yi − w · xi − b ≤ ε

w · xi + b− yi ≤ ε .

In the problem above, xi is a training sample, yi is the target value for that sample and w · xi + b is

the prediction for that sample. The constant ε is a free parameter that serves as a threshold meaning all

the predictions have to be within a ε range of the true predictions. The learning algorithm minimizes the

ε-insensitive loss function illustrated in Figure 2.4.

Figure 2.4: ε-insensitive loss function.

This function is zero for errors that do not exceed a tolerance margin between [−ε; ε]. By taking this

function as a reference, ignoring the errors less than ε.

2.1.4 Metrics

The goal of this work is to forecast values of stock prices. To evaluate the forecast results in a math-

ematical and theoretical point of view, there are some metrics that must be taken into account. Since

there is more than one model used to solve the problem, each one of them should be evaluated using

the same metrics in order to have a true comparison between them. In statistic models, and sometimes

in machine learning, errors are frequently used to measure precision and approximation to reality. The

most common error measures are described in Table 2.1.

The first thing to understand is the concept of error, also known as residuals. When modeling a time

series, the result is a line that best fits the data. This line can be linear, polynomial, etc. Even though it is

the line of best fit, the data points do not fall exactly on it, being scattered around. A residual is a vertical

distance between a data point and the regression line, and there is one residual value for each point. If

the data point is exactly over the line, the residual is zero, if it is above the line the residual is positive,

and if it is below the line the residual is negative. The sum of the residuals is always zero, and so it is

14

Table 2.1: Description of the most common metrics used in statistical works.

Error Measures Criteria Description

Mean Absolute Error MAE = 1n

∑|yj − yj |

Mean Absolute Percentage Error MAPE = 100n

∑ |yj−yj |yj

Mean Squared Error MSE = 1n

∑(yj − yj)2

Root Mean Squared Error RMSE =√

1n

∑|yj − yj |2

its mean. These residuals are often called errors, even though in this context a residual does not mean

there is something wrong with the analysis, and Equation (2.12) describes the mathematical expression

of an error:

e = y − y. (2.12)

Mean Squared Error (MSE) measures the difference between the prediction and what the actual obser-

vation. It is more commonly used to evaluate model performance. Root Mean Squared Error (RMSE) is

the square root of the average of squared differences between prediction and actual observation, and it

measures the standard deviation of the residuals, that is how concentrated the data is around the line of

best fit. It is analogous to MSE but it has the same units as the quantity being estimated, and this is the

reason why RMSE is more popularly used than MSE.

MSE and RMSE are more popular when evaluating the quality of a model fitting while MAE and

MAPE are more commonly used when measuring forecast errors in time series forecast. Mean Absolute

Error is the average of the absolute differences between prediction and actual observations where all

individual differences have equal weight. MAE also uses the same scale as the data being measured

and measures the average magnitude of the residuals without considering their direction. MAPE is the

percentage of the average of the absolute error between two variables [7].

Since this work intends to use forecast techniques to forecast in the stock market, errors by them-

selves do not give an absolute evaluation [8]. There are forecast results that can have very low values

of RMSE and MAE and generate losses for the trader. For example if a trader buy shares of a company

because he or she suspects that the share price will increase from $2 in the day after, and the actual

value happens to decrease $1, the error is not very significant but the prediction will create losses since

the trader bought a share expecting that it would be more valuable in the day after and it happened to

be the opposite. To evaluate if a prediction creates gains or losses, there are some common evaluation

metrics that need to be considered when talking about forecasting in the stock market. These metrics

are described in Table 2.2.

15

Table 2.2: Description of the most common metrics used in computational finance.

Metric Description

Return On Investment ROI = Gain of investment−Initial investmentInitial investment

Sharpe Ratio SR = Mean Return −Risk Free RateStd Return

Accuracy Accuracy = Nr. of right guessesTotal of trades

The Return on Investment (ROI) is a performance metric used to evaluate the efficiency of an in-

vestment, measuring the amount of return on an investment relative to the investment’s cost, and it is

calculated as shown in Table 2.2. The result is usually expressed as a percentage.

Since ROI does not measure how much risk is involved in producing that same return, Sharpe Ratio

is commonly calculated and usually is what the hedge funds want to maximize. Sharpe Ratio is one of

the most referenced risk/return measures used in finance and describes how much excess return one

receives for the extra volatility of holding a riskier asset. A possible scenario where the use of Sharpe

Ratio makes sense is the one when comparing an investor A that has a return of 15% and an investor B

that has a return of 12%. At first sight, it may seem that A is a better performer. However, maybe A took

a larger risk so B can have a better risk-adjusted return. For future interpretation, a ratio of 1 or better is

already considered as good, 2 or better is very good, and 3 or better is excellent.

Accuracy is also important when taking into consideration the results evaluation and it is simply the

ratio between the right guesses about the increase or decrease in the share price and the total number

of trades executed. It is also a very used metric in a lot of works about forecast in the specific context of

financial data.

When looking at the results of a work, one can not have the sensitivity to understand if it is a good

or bad result, or whether it is in line with current market return values. Analyzing the Bloomberg stocks

section, the average annual return in American indexes is 15.27%, in Europe, Middle East and Africa

indexes is 4%, and in Asia indexes, the average is 18,69%. For example, Google shares have an

average annual return of 11% and S&P500 an average of 14.19%.

2.2 Related Work

This section describes some works developed in the financial context and some models and algorithms

techniques used to predict financial time series.

16

2.2.1 Works about modeling and forecasting a time series

This section presents works concerning modeling and forecast in computational finance field using stock

and Foreign Exchange (FOREX) data.

a) Statistical Approach

Rounaghi and Zadeh [9] tried to model and forecast the stock value of 350 firms listed in London Stock

Exchange and S&P 500 from 2007 until the end of 2013 using an autoregressive and moving average

model (ARMA model). As a starting point, they verify that a forecast must have in consideration three

factors: 1) choice of the time periods (lags) that must be used as a base, 2) market trend and 3) the

prediction period. They applied monthly and yearly forecasting in both London Stock Exchange and

S&P 500 Index. To model monthly data from London Stock Exchange, and according to PACF and ACF

graphs, the model used is ARMA (4,4) because the lag beyond which the PACF cuts off is 4 and is

the indicated number of AR terms, and the lag beyond which the ACF cuts off is 4 and is the indicated

number of MA terms. The yearly data from London Stock Exchange shows an increasing behavior, being

necessary to eliminate this trend using the regression method. After this elimination, and according to

the PACF and ACF graphs, the chosen model is ARMA(3,3). To model monthly data from S&P 500,

and according to PACF and ACF graphs the model used is ARMA(4,4) because the lag beyond which

the PACF cuts off is 4 and is the indicated number of AR terms, and the lag beyond which the ACF

cuts off is 4 and is the indicated number of MA terms. Lastly, to model yearly data from S&P 500,

and according to PACF and ACF graphs, the model used is the ARMA (3,3) because the lag beyond

which the PACF cuts off is 3 and is the indicated number of AR terms, and the lag beyond which the

ACF cuts off is 3 and is the indicated number of MA terms. To measure the quality of the proposed

ARMA model, researchers used MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error),

MDAPE (Median Absolute Percentage Error), SMDAPE (Symmetric Median Absolute Percentage Error),

and MASE (Mean Absolute Scaled Error). All the equations are calculated with the following definitions:

Yt is the observation at time t=1, 2,. . . , n; Ft is the forecast of Yt; et is the forecast error (et = Yt − Ft);

pt =100etYt

is the percentage error and finally qt is determined using Equation 2.13:

qt =et

1n−1

∑ni−2 (Yi − Yi−1)

. (2.13)

The results show that medium and long-term forecasting of time series is possible in S&P 500 and

London Stock Exchange at the error level of 1%. Both markets are considered efficient and with fi-

nancial stability during periods of boom and bust. The statistical analysis of S&P 500 shows better

results against London Stock Exchange both in medium or long horizons. The analysis in London Stock

Exchange shows better results in medium horizons (monthly), outperforming the yearly results.

Vantuch et al. [10] also tried to predict future prices of the Microsoft stock (MSFT) using the ARIMA

model. The data is in a daily format and it has four years of length. To calculate the values of (p,d,q),

a Genetic Algorithm is used and the best found model was ARIMA (12, 2, 8). The Akaike Information

Criteria (AIC) and the Baysien Information Criteria (BIC) values of this model were compared with models

17

chosen without the help of the Genetic Algorithm. The Akaike Information Criterion is an estimator of

the relative quality of statistical models for a given set of data, and the Bayesian Information Criterion

is also a criterion for model selection among a finite set of models. The model with the lowest AIC and

BIC is preferred. The results considering the genetic algorithm and the ARIMA modelwith parameters

GA-ARIMA (12,2,8) showed a BIC of 458.6266 and an AIC of 400.4396 while an ARIMA (2,1,3) without

the Genetic Algorithm showed a BIC of 434.0470 and an AIC of 408.1205. The GA-ARIMA model did

not show significantly better results, as it is observable, being the GA-ARIMA AIC only slightly inferior

to the AIC of the ARIMA. Also, the tests with PSO optimization did not prove that the estimation of the

coefficients by PSO has significant importance in the ARIMA results, maybe because of the low number

of PSO iterations. PSO is more popular in parallel computing, where it can obtain better results.

Wu & Lu [8] used Neural Networks to predict future values of the S&P 500 Index. Their paper

compares Neural Networks performance against an ARIMA model and the Neural Network model out-

performed the ARIMA model only in a stable market. When dealing with volatile markets, the Neural

Network system only showed an accuracy of 23% and the ARIMA’s accuracy was 42%. The same

comparison was conducted by Kamruzzaman & Sarker [11] but applied to exchange rates. The Neu-

ral Networks were trained with back-propagation, scaled conjugate gradient, and back-propagation with

Bayesian regularization. The algorithm used technical indicators and outperformed the ARIMA model,

having an impressive accuracy of 80%. Considering this results, Gerlein et al. [8] stated that an ac-

curacy of 80% must be considered with care since only the best results are reported and in general

machine learning techniques do not present such high levels of accuracy.

b) Machine Learning Approach

Machine learning is becoming very popular in all kinds of prediction. Some authors see the forecasting

challenge as a simple classification problem where they only want to classify the future in classes such

as up, down or sideways trends. Others use this classification only as a first stage.

Mandziuk et al. [12] used Neural Networks to train data in order to do a prediction for a 5-day pe-

riod of EUR/USD trading. The data that is used in Forex has a limited time of applicability so the input

data in forecasting models should change and should be as diverse as possible. In order to choose a

suitable subset of input variables to train with Neural Networks, Mandziuk et al. [12] used GA to perform

the selection process out of a large pool of diverse data sources available. Supposing that a chromo-

some consists of N data sources, the Neural Network has N inputs, N/2 hidden layers, and one output

(forecasted change of EUR/USD on the following day). If the output is positive, it means a purchase

opportunity. Results were compared with MACD, MA and CONTINUE methods, three deterministic

methods, and also with an early version of the proposed model. Neuro-evolutionary model proved to

have more than 56% of correct decisions, 30% more than wrong ones. It has way more activity than the

rest of the algorithms in comparison. The weighted version has more than 111% of profit corresponding

to more than 25% of annual profit.

Yoo et al. [13], in their survey on machine learning techniques for stock market prediction, talk about

the use of Neural Networks, Support Vector Machines, and Case Based Reasoning. The researchers

18

admit that Neural Networks are gaining a lot of popularity in this field of study but they have some

related issues, such as the black box problem, meaning that it is not known the significance of each

variable neither it is possible to understand how the network produces future prices. Another problem

with Neural Networks is the overfitting problem since Neural Networks fit the data too well and lose the

ability of generalization. This can be due to many nodes in the networks or long periods of training. Yoo

et al. [13] also assume Support Vector Machines to be very interesting when applied to classification

and regressions tasks in time series prediction related to financial applications. Unlike Neural Networks,

Support Vector Machines are resistant to overtraining achieving a high generalization performance and

one of the main advantages is that it is equivalent to solving a linear quadratic problem, having a unique

and globally optimal solution while Neural Networks have the danger of getting stuck at local minima.

Despite this, when entering with event information such as web mining information, Neural Networks

show better results.

Kim [14] also stated some of the limitations of Neural Networks, such as the overfitting problem and

the local optimal solution, and tried Support Vector Machines to solve the problem of predicting future

prices in the stock market. Technical indicators are used in this solution and the prediction is done in

a way that the output only takes values “0” if next day’s index is lower than today’s index, and “1” if

the next day’s index is higher than today’s index. In this solution, Support Vector Machine outperforms

Backpropagation Neural Networks and Case Based Reasoning, with a hit ratio of 57.8313%, Neural

Networks with 54.7332%, and CBR with 51.9793%. This study ends with a proposal of a Support Vector

Machine hyper-parameters optimization and also with the conclusion that low accuracies are a common

and expected result when dealing with capital markets since there is no single model perfectly suited in

all market conditions. Tay and Cao [15] compared Support Vector Machines against back propagation

Neural Networks to forecast future contracts and the Support Vector Machines solution obtained better

accuracy (47.7%) than the Neural Networks solution (45.0%).

The same happened in the work reported by Chen and Shih [16] where the two techniques were

compared when applied to six Asian indices, with an accuracy of 57.2% for the Support Vector Machines

and 56.7% for the Neural Networks.

Putting Neural Networks aside and entering with K-Nearest Neighbors, Chen and Hao [17] pro-

posed a hybridized framework of the Feature Weighted Support Vector Machine (FWSVM) and Feature

Weighted K-Nearest Neighbors (FWKNN) to predict stock market indices. The FWSVM was used to

classify the technical indicators of the stock data and the output of the classification is either “1” or “-1”.

This value is then compared with the class label to compute the accuracy of the model. After this classi-

fication step, an FWKNN algorithm is used to find K nearest neighbors of the testing data and to evaluate

the mean of those neighbors to predict prices. The proposed model was applied to the Chinese stock

market (the Shanghai and Shenzhen stock indices) and the results were slightly better than for regular

Support Vector Machine with K-Nearest Neighbors approach (SVM-KNN) and sometimes even equal.

For Shanghai composite index, FWSVM-FWKNN is better than SVM-KNN for time horizons of 1, 5,

15, and 30 days, even though the results have very similar values of Mean Absolute Percentage Error

(MAPE) and RMSE. For a time horizon of 10 and 20 days, the values of MAPE are the same. When

19

observing the Shenzhen composite index results, the SVM-KNN and FWSVM-FWKNN are even more

similar.

Still within the scope of simpler algorithms, in his work with simple classifiers (instance-based classi-

fiers, decision trees, and rule-based learners), Barbosa [18] claimed outstanding financial results taking

advantage of the low computational requirements for both the training and the classifying process of

such algorithms.

Gerlein et al. [8] concluded in their work that models do not generalize well when using a large

dataset since points in time closer to the trading period that will be predicted are more likely to exhibit

similar conditions. For testing accuracy and cumulative returns, Gerlein et al. [8] got the best setup

with 1000 instances, retrained over 10 periods with five attributes. Even though the results were very

satisfying, with accuracies reaching 53.70% and cumulative returns reaching 156.82%, the period of

trading between 2007 and 2009 does not present good performance since it was the period of the

economic crisis. This may suggest that simple machine learning algorithms can be useful in times of

normal market conditions, but can be weak predictors in odd periods.

2.2.2 Works on forecast concerning Big Data

This section describes forecasting works that took into consideration the rising problem of Big Data.

Liu and Wang [19] found trading in financial markets a big data problem since large transaction data

for 120 futures contracts are produced every minute. Given this amount of data, the researchers decided

to store it in a distributed way using Hive, a high performance distributed database system. Various repli-

cas are stored to keep the information safe and reliable. After the storage step, the data is processed with

MapReduce. The control node distributes assignments while the computing nodes compute the features

(maximum, minimum, count, summation, mean value, sigma, median value and median absolute devia-

tion (MAD)) and train the Decision Tree with Support Vector Machine model proposed (DT-SVM). First,

the data is fetched from the distributed database and then is splatted into different groups according to

their time spams. From each group, features are extracted, being each group an input of the distributed

system. After this step, the number of values larger or smaller than the mean value, the median value,

and the number of values within 3-sigma rule are calculated. Data is classified as “1” if the price has

increased by a certain percentage, “-1” if it has decreased and “0” if remained the same. The hybrid

model, DT-SVM, is then used to train the data with the help of the statistical features. The need for a

hybrid model is to solve the imbalance data problem and to filter the amount of noise. Decision Trees

filter most of the noise and leave the data for Support Vector Machine in good quality. Support Vector

Machine handles then the complexity of data. The overall strategy is represented in Figure 2.5.

This model was compared to Bootstrap-SVM, Bootstrap-DT, and Back Propagation Neural Networks

(BPNN) and outperform these three strategies in precision rate, recall rate, and F-One rate. Using

timestamps of 60 minutes, the precision of this model is about 70%. The reason why this happens is the

20

Figure 2.5: Representation of the problem architecture (adapted from [19]).

use of a two-phase classifier to handle large amounts of information, taking in mind the imbalance and

noise that characterizes a big data platform.

Liu et al. [20] used clustering to improve NNs in order to forecast a financial time series. Clustering

is the task of grouping data points so that points within each cluster are similar to each other [21]. The

main purpose of this grouping is such that every group is used to train a corresponding neural network

for prediction and this way the model does not have to handle a big data group as an input to the neural

network. The clustering algorithm used in this research was fuzzy c-means and the neural network is

the RBF one, that is quicker in convergence and more precise in modeling than back propagation neural

network. To solve the data imbalance problem is used an experimental data preprocessing method that

handles the normalization and smoothing in one process. To evaluate the system, researchers adopt Av-

erage Absolute Error (E) and Trend Accuracy in Direction (TAD). The RBF Neural Network with clustering

was compared with the normal RBF in precision, efficiency, complexity and outlier detection. Consider-

ing precision, RBF Neural Networks with clustering are more precise, being the predicting value closer

to the real one; considering efficiency, smaller data groups generates less training times, reducing the

convergence time; considering complexity, clustering contributes to the complexity reduction because

each group has high similarity between their points; finally considering outlier detection, clustering can

efficiently detect outliers and keeping them away from the training.

2.2.3 Resume

This chapter introduced all of the terms that will appear in the next sections and gave a brief explanation

about them. More specifically, this section presented a description of what is a stock market, the tech-

niques used in this work to predict stock markets and how this prediction is evaluated. The analysis of

21

some related works is also presented in this chapter since it is important to know what has been done

in this thesis field of research. Table 2.4 is a resume of these related works.

Table 2.4 is a resume of all the referenced works and intends to show a comparison between them in

order to serve as base and start point when thinking about which techniques to use. Looking at this table,

it is possible to see Support Vector Machines, Neural Networks, and ARIMA model are the most popular

models. It is also possible to conclude that accuracy appears very often as an evaluation function, such

as ROI and RMSE/MSE. To compare each of the models, only the accuracy is used since is presented

in almost all of the works and is a good trade-off between error and profit, serving both of the goals of

forecasting close to the real value generating profit. In general, accuracies are between 50% and 60%,

as it is observable, and accuracies far superior to that values need to be carefully analyzed [8]. Support

Vector Machines and Neural Networks have the best accuracies compared to other models. Looking at

column “Period”, a lot of works use an average of five years of data, except [11] and [14] that used more

or less ten years.

Based on this comparison table and the analysis made in this section, another table, Table (2.3),

illustrates advantages and disadvantages of the principal techniques presented in the related works.

Table 2.3: Algorithm comparison based on the Related Work.

Model Advantages Disadvantages

NN

1. Capable of discover non-linear relation-ships makes it ideal for modeling non-linear dynamic systems;

2. One of the more accurate algorithms.

1. Overtraining problem, losing generality;

2. Black Box problem, not revealing thesignificance of each variable and theway they weigh independent variables;

3. Danger of getting stuck at local minima.

SVM

1. Training is equivalent to solving a lin-early constrained quadratic problem;

2. Does not have the black box and theovertraining problem;

3. The solution is relatively unique andglobally optimal.

1. Usually not so accurate as Neural Net-works.

ARIMA1. Good results for clear trends;

2. Often used in forecasting research.

1. Usually not so good in irregular series;

2. Less accuracy in multiple-period-aheadforecasting.

22

Tabl

e2.

4:R

esum

eof

Rel

ated

Wor

k.

Wor

kD

ate

Met

hod

Fina

ncia

lApp

licat

ion

Per

iod

Eva

luat

ion

Func

tion

Com

pari

son

Acc

urac

y

[9]

2016

AR

MA

Lond

onS

tock

Exc

hang

ean

dS

&P

500

2007

-20

13M

AE

,MA

PE

,MD

AP

E,

SM

DA

PE

,MA

SE

Mon

thly

vs.

S&

P50

024

.9%

[10]

2014

GA

-AR

IMA

Mic

roso

ftS

tock

Pric

eIn

dex

(MS

FT)

4ye

ars

BIC

,AIC

,MS

EA

RIM

A-

[11]

2003

NN

US

D/A

UD

,GB

P/A

UD

,JP

Y/A

UD

,S

GD

/AU

D,N

ZD/A

UD

,CH

F/AU

D19

91-

2002

NM

SE

,MA

E,D

SA

RIM

A80

%

[12]

2016

NN

EU

R/U

SD

2009

-20

14P

rofit

,Effi

cien

cy,

Wei

ghte

dE

ffici

ency

MA

CD

,MA

,C

ON

TIN

UE

-

[14]

2003

SV

MS

tock

KOS

PI

1989

-19

98A

ccur

acy

NN

,CB

R57

.831

3%

[15]

2001

SV

MS

tock

sC

ME

-SP,

CB

OT-

US

,C

BO

T-B

O,E

UR

EX

-BU

ND

,M

ATIF

-CA

C40

1992

-19

99A

ccur

acy

NN

47.7

%

[16]

2006

SV

MFu

ture

sN

K,A

U,H

S,S

T,TW

,KO

1984

-20

02M

SE

,NM

SE

,MA

E,D

S,

WD

SN

N65

.333

%

[17]

2017

FWS

VM

-FW

KN

NS

tock

sS

SE

,SZS

E20

08-

2014

MA

PE

,RM

SE

SV

M-K

NN

-

[18]

2011

IBC

,DT,

RB

LE

xcha

nge

Rat

esan

dS

tock

s20

07-

2009

MD

D,R

OI,

RM

DB

etw

een

each

othe

r21

.9%

[8]

2016

NB

,K

*mod

el,

C4.

5,LM

T,O

neR

Exc

hang

eR

ates

and

Sto

cks

2007

-20

13A

ccur

acy,

RO

IB

etw

een

each

othe

r53

.70%

23

Chapter 3

Proposed Architecture

The proposed solution intends to predict the future values of a share taking into consideration the past

values of the same share. One statistical algorithm and two machine learning algorithms will be used in

order to solve the proposed problem of forecasting in the stock market. This chapter will go deeper into

the architecture of this solution.

3.1 Architecture Design

This section describes the architecture design of the proposed solution for forecasting future prices in

the stock market. The module that is developed in this work was implemented in python and it can be

divided into three layers as shown in Figure3.1.

Figure 3.1: Proposed architecture.

The overall solution uses a statistical method, ARIMA, and two machine learning algorithms, SVR

and KNN, to solve the same problem, i.e. forecasting future prices of a share. To perform this prediction,

old price sequences are used as inputs to the algorithms. After fitting ARIMA and Machine Learning

25

algorithms to the data, the result of the forecast is evaluated using different metrics and evaluation

parameters. The algorithms are also compared with each other to take some conclusions about which

of them is better to solve the problem in different contexts. The choice of these three specific algorithms

was not only related to the state-of-the-ars but also with the motivation of trying three different methods

to solve the same problem. ARIMA was chosen because it is one of the most popular statistical models

and also it has a lot of documentation. In the machine learning field, since Support Vector Machines

were showing such good results in classification tasks [13], SVR was chosen to check the behavior of

Support Vectors for regression tasks. Finally, KNN is introduced in a representation of the simple lazy

machine learning algorithms to see if a simple non-parametric method can have good results in forecast

tasks.

In the architecture design, Data Layer is the one responsible for fetching stock prices series needed

as inputs to the Train Layer. After the data is fetched, it is divided into two groups: the training and the

testing set. The training data is used to train the algorithm in question (ARIMA or Machine Learning

Algorithm), and the testing dataset is used as a comparison term to the forecast results.

Train Layer is responsible for the hyper-parameters optimization and model training using the training

dataset. This layer is divided into statistical sub-layer and machine learning sub-layer since these two

approaches are considered and implemented. This separation is also observable in Figure3.1.

After the right model is chosen and trained, the forecast can start. The Forecast and Validation Layer

is then concerned with the forecast and its results, as well as their comparison. A uni-step (daily) and

multi-step forecast (weekly and monthly) will be presented using different evaluation metrics in order to

show a wide comparison between the statistical method and machine learning algorithms.

3.2 Architecture Implementation

3.2.1 Data Layer

Data Layer is the one responsible for treating the price sequences that will serve as an input to the Train

Layer. The two main reasons for this layer to exist are:

1. Transform the .csv files that contain the daily open, close, highest and lowest prices of a stock into

a time series with the close price;

2. Separate the data into training and test set.

The transformation of a comma-separated values file (.csv file) into a time series takes place in a

function that uses pandas library and the pseudo-code describing this small function is presented in

Figure3.2.

When having the data in a time series format, this data is divided into training and test datasets. The

reason behind this separation is because the algorithms cannot be evaluated with the same data that

26

Figure 3.2: Pseudo-code for the transformation of a .csv file into a time series

they were trained since it would lead to an overfitting problem and lost of generality. Overfitting is the

non-ability of a model to generalize to out-of-data samples. It usually happens when too many features

are considered and the model fits the training set with a cost function of almost zero. In the opposite,

underfitting is when the model does not follow the behaviour of the training set. Both situations are

problematic and for a better understanding they are illustrated in Figure3.3.

(a) Overfitting (b) Underfitting

Figure 3.3: Overfitting and Underfitting.

The training dataset is then used for learning and tuning the hyperparameters of each of the algo-

rithms, and the test dataset is used to assess the model performance of the algorithm in question. Taking

this into consideration, the percentage chosen for the training set was 80% and for the testing set 20%.

This procedure is illustrated in Figure3.4.

Figure 3.4: Data separation into Training and Test sets.

The Data Layer ends when a time series is ready to be used by the algorithms in question. After this

steps, the training dataset enters the Train Layer where each of the algorithms is trained using this set

of data.

27

3.2.2 Train Layer

The Train Layer, as the name suggests, is responsible for training each of the three algorithms developed

in this work using the training dataset. Training a model is applying it to a training dataset so that the

model can perceive hidden patterns and mapping relationships that help it perform well during the test

period. This layer is divided into two sub-layers, the statistical and the machine learning one. The

reason behind this separation is the existence of common steps in the implementation of the machine

learning algorithms that do not make sense in the implementation of a statistical model and vice versa.

In the statistical sub-layer the ARIMA model is implemented, and in the machine learning sub-layer, the

Support Vector Machine and K-Nearest Neighbor are implemented.

a) Statistical Sub-Layer

The only statistical method implemented is an ARIMA (p,d,q) model. To use the ARIMA model, the

python statistic library statsmodels.tsa.arima model is used, and to conduct some stationarity analy-

sis “adfuller” is imported from statsmodels.tsa.stattools. The statsmodels.tsa package contains model

classes for time series analysis, including ARIMA, as well as related statistical tests such as the Dickey-

Fuller test. It is built on top of the NumPy and SciPy, and it also integrates with Pandas.

The process of training the ARIMA (p,d,q) with the training dataset is described in three steps: 1)

analyze the dataset stationarity and infer stationarity if needed; 2) find the optimal ARIMA parameters p,

d, and q; 3) fit the best ARIMA (p,d,q).

Step 1: Check for stationarity

The first thing to check when applying the ARIMA model is the stationarity of the time series. Price

sequences are not usually stationary and they usually need some transformations to show a stationary

behavior. To check for stationarity, a Dickey-Fuller Test is conducted. For this test, the regression model

is estimated, y′t = α+βt+φyt−1+γ1y′t−1+γ2y

′t−2+...+γky

′k , where y′t denotes the first-differenced series

and k the number of lags to include in the regression. If the original series is non-stationary, then the

coefficient φ should be approximately zero. The null-hypothesis states that the data is non-stationary.

If φ < 0, the hypothesis is rejected meaning that it is stationary. Based on this stationarity test, two

things can be done: 1) stabilize the variance and 2) stabilize the mean. To stabilize the variance, a

non-linear transformation should be applied. In this case, this transformation is a logarithmic one. To

turn the mean a constant value, a first-difference is applied, meaning computing the differences between

consecutive observations. This process should be repeated until the series is stationary. Once the series

is stationary, the model can be trained using all the past prices until instant t.

Step 2: Finding the best order of p, d, and q

When having a series with constant mean and variance, the next step is the choice of the right

ARIMA model. Selecting the right order is finding the best combination of the three parameters (p,d,q)

for a specific training data set. The only thing that the train algorithm needs to know is the maximum and

minimum values that these parameters can take and these values are described in Table 3.1.

The search for the best ARIMA parameters can be done in various ways, but since the values of p

28

Table 3.1: ARIMA parameters.

Parameters Range

p [1,5]d [1,2]q [1,5]

and q are not usually greater than 5, a brute force search is used to find the best possible combination of

parameters. The ACF and PACF plots can help on this task giving an idea of which lags are correlated

and maybe reduce the search for their information. For each set of parameters, a model evaluation is

taken using the mean squared root (MSE) as choice criteria. The best combination is the one with the

lowest MSE.

Step 3: Fit the model

After choosing the best ARIMA model based on the lowest values of MSE, the model is fitted to

the data and future prices can be calculated. ARIMA uses all the information (prices) from the past as

inputs.

The pseudo-code for the ARIMA implementation is in Figure3.5.

Figure 3.5: Pseudo-code for the ARIMA implementation.

29

b) Machine Learning Sub-Layer

In this sub-layer, two algorithms are implemented: Support Vector Regression (SVR) and K-Nearest

Neighbors (KNN).

To use Support Vector Regression, the scikit-learn library is used. Scikit-learn is an open-source ma-

chine learning library for python and is sponsored by INRIA, Telecom ParisTech and by Google through

Google Summer of Code. From scikit-learn, the sklearn.svm algorithm implementation is used. Support

Vector Regression follows the logic described in Section 2.1.

The implementation of K-Nearest Neighbors is very similar to the Support Vector Regression imple-

mentation. The library used is scikit-learn, more specifically the sklearn.neighbors with KNeighborsRe-

gressor.

As the two used techniques are machine learning algorithms that use supervised learning, there are

some implementation steps in common. These implementation steps are described below and their

structure is the following:

1. Search the right number of features;

2. Tune the hyper-parameters of each model;

3. Fit each of the two models.

Step 1: Choosing features and targets

Initially, the dataset that enters this layer is in a time series format. Taking into consideration that

the two algorithms use supervised learning, it is necessary to transform the dataset into a format that

matches the supervised learning problem format. Supervised learning concept was introduced in Sec-

tion 2.1 and the short idea is to have input variables called features (X), output variables called targets

(y), and an algorithm to learn the mapping functions from the input to the output. The goal of this map-

ping function is to maximize the mapping so well that when having new input data (X), it is possible to

predict the output variables (y) for that data. Time series data can be phrased as supervised learning,

having features and targets, using previous time steps as input variables (features) and use the next

time step as the output variable (target).

The use of previous time steps to predict the next time step is called the sliding window method and

the number of previous time steps is called the window width. This sliding window method is the basis

of turning any time series into a supervised learning format, and once a time series is prepared this

way, any of the standard linear and nonlinear machine learning algorithms can be applied. This window

width corresponds then to the number of features. An example of a transformation of a time series into

a supervised learning problem with two features and one target is illustrated in Figure3.6. The (t − 1)

and (t− 2) are two features, and the Close Price is the target.

The window width, in other words the number of features, is not known in the beginning and has to

be calculated. This width can be of any value but there is one optimal width that maximizes the mapping.

To find this optimal width, all windows of width 5, 10, 15, and 22 are tested in the training dataset. These

values correspond more or less to one, two, three and four weeks of prices, since the stock market

30

t Close Price

1 102 203 304 405 506 607 708 80

t− 1 t− 2 Close Price

10 20 3020 30 4030 40 5040 50 6060 70 80

Figure 3.6: Transformation of a time series into a supervised learning format.

closes during weekends. The reason why the range is not bigger than 22 is that today’s price is more

influenced by the last 22 days than for the days before. For each iteration along the window width range,

the machine learning algorithm is trained.

Step 2: Tuning hyper-parameters

For each iteration along the window width range, the machine learning algorithm is trained to find

the best set of hyper-parameters for that window width. The search for the best set of hyper-parameters

is done by a RandomizedSearchCV, implemented by sklearn.model selection.RandomizedSearchCV.

This method is similar to a GridSearchCV, where a python dictionary is created using combinations of

the algorithm hyper-parameters, and then the dictionary combinations are tested and scored consid-

ering the resulted MSE. The difference between RandomizedSearchCV and GridSearchCV is that in

RandomizedSearchCV not all the dictionary combinations of hyper-parameters values are tried out, but

rather a fixed number of parameter settings is sampled from the specified distributions reducing the

computational effort. The number of parameter settings that are tried out can be chosen.

When calculating the best hyper-parameters, it is important to avoid train and validate the same data

since this can cause overfitting. The cross-validation method is used to split the data into training and

validation sets. The logic behind this method is simply dividing the data into k-folds, using “k-1” folds

for training and the remaining one for validating and cost calculation (MSE). The process is repeated

“k” times and the validation folds alternate in each one of the k rounds. Finished the “k” iterations,

the average cost of the validation sets is calculated. In the end, there is an average cost for each set

of parameters, and the one with the lowest cost is the chosen one. The illustration of this process is

described in Figure3.7. It is important to notice that these training sets are not the same as the one from

Figure3.4. The training set from Figure3.4 is represented in Figure3.7 as “All training dataset” and the

training and validation sets are partitions of that “All training dataset”.

Summarizing and concluding, for each of the window width the best set of parameters is calculated.

At the end of all the iterations through the window width range, the scores of the best parameters for

each of the windows are compared, and the one with the lowest MSE is the chosen one. The result is

then the optimal window width (number of features) and the parameter’s values.

In Support Vector Regression, the three hyper-parameters are described in Table 3.2, as well as the

31

Figure 3.7: Cross-Validation with K=3.

test range of each of the parameters. The C parameter is called soft margin. A small value of C allows

ignoring points close to the boundary, increasing the margin while a large value of C assigns a large

penalty to errors and margin error, so the margin is smaller in those cases. The decision boundary is

also affected by the kernel that is commonly either linear, polynomial or Gaussian, also known as Radial

Basis Function (RBF). The degree of the polynomial kernel and the width parameter of the Gaussian

kernel, gamma, influence the flexibility of the decision boundary.

Table 3.2: SVR parameters

Parameters Range

C 1, 10, 100, 1000

Kernel Linear, polynomial, RBFGamma 0.1, 0.01, 0.001, 0.0001

In K-Nearest Neighbors, there is only one hyper-parameter to tune and is described in table 3.3, as

well as its test range. The value of K is the number of nearest neighbors used to predict the value in

question.

Table 3.3: KNN parameters

Parameters Range

K [1, 50]

The process until this point is described in the pseudo-code of Figure3.8.

Figure 3.8: Pseudo-code for feature selection and hyper-parameters tuning.

32

Step 3: Fit the model

In this step, the model is fitted using the best window width and hyper-parameters calculated. For

example, for support vector machine, if the optimal number of features discovered is 10, and the optimal

hyper-parameters discovered are [C=1, gamma = 0.1, kernel=‘rbf’], SVR will use sequences of the last

10 prices to characterize the next instant, and tomorrow’s price will be predicted using a soft-margin of

1 and a Gaussian kernel with an inverse-width parameter of 0.1. For example for k-nearest neighbors, if

the optimal number of features discovered is 20, and the optimal hyper-parameter discovered is [K=10],

KNN will use sequences of 20 prices to characterize the next instant, and it will use the 10 closer

neighbors to predict tomorrow’s price.

3.2.3 Forecast and validation Layer

The last step for each one of the studied models is the validation of the predictions using the testing

dataset. There are four evaluation metrics calculated in this layer: mean absolute error (MAE), return on

investment (ROI), Sharpe Ratio (SR) and accuracy.

The mean absolute error is implemented with sklearn.metrics that contains the mean absolute error

function. This function takes as an input the actual values and the predicted values, calculates the

absolute deviation between each pair of predicted value and actual value, and computes the average of

these deviations. To accomplish this calculation, the predicted value of each trade is appended to a list

called “predictions” to serve as an input to this mean absolute error function. The pseudo-code of this

calculation is in Figure3.9.

Figure 3.9: MAE calculation.

To calculate the Return On Investment, a strategy must be implemented to enter and exit the market.

Since the focus of this work is forecasting prices and not optimizing strategies, a very simple strategy

is implemented, just to evaluate how well a strategy would work depending on the prediction made.

Depending on the prediction for the day after, one of two strategies is created in each trade:

1. If the predicted price is greater than today’s price, one should buy shares (go long), and a variable

named strategy takes the value 1;

2. If the predicted price is smaller than the today’s price, one should “short” the stock, and a variable

named strategy takes the value −1.

33

After the strategy definition, the profit for that same trade is calculated using today’s price as “cost of

the investment”, tomorrow’s actual price as “gain of the investment”, and multiplying it by the strategy

variable 1 or −1. For example, if the predicted price reflects an increase in the share price, the strategy

takes value 1, and if the price actual increases, for example, from $50 to $75, the profit is 0.50 times −1

and representing a positive profit. If on the other hand the price dropped to $25, the profit is -0.50 times

1 and is a negative profit since the trader was wrong about tomorrow’s price. If a “go long” strategy was

already taken and the strategy chosen is “going long”, the position is maintained. The same happens for

the short strategy. If the trader owns a share and the chosen strategy has the value −1, this is equivalent

to sell and immediately “go short” on that share. The pseudo-code of this calculation is in Figure3.10.

Figure 3.10: ROI calculation.

The Sharpe Ratio is calculated at the end of the test period using the accumulated profits calculated

along the trading period, dividing its mean by its standard deviation. So for that, a list called profits

containing all the profits per trade is used as an input to the Sharpe Ratio function. The pseudo-code of

this calculation is in Figure3.11.

Figure 3.11: Sharpe Ratio calculation.

Finally, the accuracy counts the number of times the strategy was in agreement with reality, in other

others, the number of times that the strategy chose long or short and the price actually raised or dropped

respectively. The pseudo-code of its calculation is in Figure3.12. A right guess generates a positive

profit, so the accuracy is the number of positive profits divided by the number of trades. To calculate the

number of right guesses, each time a positive profit is calculated, a variable that counts the number of

right guesses, (nr right guesses), is incremented, and if the profit is negative nothing happens to that

variable, maintaining its value. The accuracy is then the result of this variable divided by the number of

trades (t).

34

Figure 3.12: Accuracy calculation.

These metrics are not individually calculated, even though the pseudo-code for each metric is shown

separately. This separation intends only to explain the implementation of the specific metric in ques-

tion for a better understanding and replication. Figure3.13 shows the integration of all these metrics

calculation in the prediction period.

Figure 3.13: Evaluation metrics calculation

When validating the models, different stocks are considered as well as different trading periods in

order to evaluate the model’s performance in different situations.

3.2.4 Resume

This chapter presented the overall architecture. The implementation since the data layer until the results

evaluation is fully described, as well as pseudo-code for each of the algorithms.

35

Chapter 4

Results

This chapter describes the performance of each of the three implemented algorithms following the ar-

chitecture described in Chapter 3 in the task of forecast future prices of a quoted stock. Since stocks

can have very different behaviors, two different stocks were chosen: one with a clear increasing trend,

and one considered to have no trend, having ups and downs. The three algorithms are tested in this two

stocks with a test period of 1 year. The stock market is closed during the weekend so a trading period

of 1 year corresponds more or less to 251 trading days, 1 week corresponds to 5 trading days, and 1

month to 22 trading days. The chapter is divided into five sections as enumerated below:

1. ARIMA performance: this section describes how ARIMA behaves with a clear trend stock and

with a sideways stock, starting by the process of the turning the series stationary and ending with

a conclusion of the obtained results and a comparison with Buy&Hold (BH) and ARIMA related

works;

2. KNN performance: this section describes how KNN behaves also with a clear trend stock and with

a sideways stock, going through the calculation of its K parameter and ending with a conclusion of

the obtained results and a comparison with B&H and previous works related to KNN;

3. SVR performance: this section describes how SVR behaves with the same clear trend stock and

sideways stock, discriminating the hyper-parameters that were used by the algorithm for each

situation, ending with a conclusion of the obtained results and a comparison with B&H and state-

of-the-art;

4. Comparison of the three models taking into account the obtained results in the three previous

sections;

5. Studying the impact of retraining KNN and SVR.

Stocks with clear trends are usually easily identified because they have very strong increasing/decreasing

behaviors. Even though this information is clear just by looking, sometimes a time series can be difficult

to analyze so it is common to make a decomposition of the time series to check for this trend. This

37

decomposition results in the identification of a trend, a seasonal component and noise (the random vari-

ation in the series). The stock that is used as a “clear trend stock” is the VeriSign, Inc. (VRSN) between

a period of five years from 2013 and 2017. The graphic representation of the time series is shown in

Figure 4.1, as well as its decomposition in trend, seasonality and noise components, and it was obtained

with matplotlib. As it can be seen in Figure 4.1, the stock has a very strong and clear trend but does not

have any kind of seasonality since the variation in the seasonality is only of 0.0001 units.

Figure 4.1: VRSN Stock Decomposition.

There are some stocks that do not have a clear behavior. In general, neither they constantly increase

or decrease. These quoted companies have ups and downs in their growth and Franklin Templeton

Investments (BEN stock) is one of those sideways stocks. The graphical representation of this stock is

shown in Figure 4.2 as well as its decomposition in trend, seasonality and noise component. As it is

observable, the trend component does not show any consistent uptrend or downtrend and the values for

seasonality are again very low and insignificant.

The three algorithms are then tested in these two different stocks and compared with each other also

considering the B&H strategy and the collected state-of-the-art. The Buy&Hold strategy, also know as

B&H, is a common strategy especially on the stock market that consists in buying a share and waiting

months or even years expecting that somewhere in time it would give some profit. The Short&Hold, also

know as S&H, consists in shorting a share and waiting months or even years expecting that somewhere

in time it would give some profit. The Random Walk theory suggests that shares prices take a random

and unpredictable path since they are independent of each other, so the past movement does not in-

fluence a future movement. For each case study, the evaluations metrics will be the one described in

Chapter 2.1: Mean Absolute Error (MAE), Return on Investment (ROI), Sharpe Ratio (SR) and Accuracy.

Also for each of the case studies, a daily, weekly and monthly forecast periods will be presented.

38

Figure 4.2: BEN Stock Decomposition.

4.1 ARIMA Performance

In this section, ARIMA is tested with a clear trend and a sideways stock. For each stock, a daily, weekly

and monthly forecast will be performed and the results will be discussed also taking into account the

B&H strategy. The ARIMA training set for both stocks corresponds to 4 years, from 2103-02-08 to 2017-

02-08, corresponding to 80% of the total dataset. The test set corresponds to 1 year, from 2017-02-09

to 2018-02-07. This separation is illustrated in Table 4.1.

Table 4.1: ARIMA data.Parameters Range

Training Set from 2103-02-08 to 2017-02-08Test Set from 2017-02-09 to 2018-02-07

Before applying ARIMA to any kind of stock, the stationarity of the series must be assured. After the

time series becoming stationary, the model can be fit and the predictions can be made. ARIMA uses all

previous data during the training phase and the same happens in the test period. For example, if the

goal is to forecast tomorrow’s price, and there is data available from the last 5 years, ARIMA uses all that

data during the training phase and when forecasting tomorrow’s price, all the data serves as an input.

4.1.1 Stock with a Clear Trend

This subsection describes the ARIMA performance with a clear trend stock, the VeriSign, Inc. (VRSN)

stock. The time series representing VeriSign, Inc. (VRSN) is not stationary since it has an increasing

behavior and a stationary time series must have a constant mean and variance, as explained in the

ARIMA background (Section 2.1). This can be proven taking a Dickey-Fuller test since sometimes can

be hard to be sure about the stationarity of a time series just by looking at it. The results of this test are

described in Figure 4.3.

39

Figure 4.3: Results of Dickey-Fuller Test for the original series.

A stationary time series must have a Test Statistic smaller than the Critical Values and this is not the

case since by looking at the Dickey-Fuller test of Figure 4.3, it is possible to see that the Test Statistic

is higher than the Critical Values. Two solve this problem, two things have to be done: stabilize the

variance, and stabilize the mean. To stabilize the variance, a logarithmic transformation is done and

the first-differences are calculated in order to turn the mean a constant. After these transformations, a

second Dickey-Fuller test is conducted, Figure 4.4, to see if the series is now stationary. The Dickey-

Fuller test from Figure 4.4 confirms that after the transformations the series is stationary, since the Test

Statistic is lower than the Critical Values, and the ARIMA model can be now be applied.

Figure 4.4: Results of Dickey-Fuller Test for the stationary series.

Next step is to find the best ARIMA order to predict daily, weekly, and monthly prices. All three

approaches use the same stationary time series, so this step above is done only once.

For the daily forecast, 251 days are predicted and, for each prediction, a strategy is calculated. In

order to find the best combination of parameters, all combinations of ARIMA (p,d,q) were tested, with p

and q with ranges between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were

made in order to predict the best combination based on the lowest MSE. The result of this brute search

is the ARIMA (2,1,0), with two autoregressive terms, one order of difference (like expected) and zero

moving average terms.

For the weekly forecast, 49 points are predicted and, for each prediction, a strategy is calculated.

When forecasting more than one value ahead with ARIMA, this number of out-of-sample is described

by the parameter ”steps” of ARIMAResults.forecast() function. In the weekly situation, the number of

steps is 5. The output of the forecast function is the 5th out-of-sample point. In order to find the best

combination of parameters for the weekly forecast, again all combinations of ARIMA (p,d,q) were tested,

with p and q with ranges between 0 and 5 and dwith a range between 0 and 2. In total, 50 executions were

40

made in order to predict the best combination based on the lowest MSE. Now, this MSE is relative to the

weekly forecast values. The result of this brute search is the ARIMA (0,1,3), with zero autoregressive

terms, one order of difference (like expected) and three moving average terms.

Finally, the monthly forecast is a prediction for the next 22 days and in the end 11 points are predicted,

and for each prediction, a strategy is calculated. In the monthly situation, the number of steps is 22 and

the output of the forecast function is the 22nd out-of-sample point. In order to find the best combination

of parameters for the monthly forecast, all combinations of ARIMA (p,d,q) were tested, with p and q with

ranges between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were made in

order to predict the best combination based on the lowest MSE. Now, this MSE is related to the monthly

forecasted values. The result of this brute search is the ARIMA (0,1,3), with zero autoregressive terms,

one order of difference (like expected) and three moving average terms.

The ARIMA performance results for forecasting daily, weekly and monthly future prices of a clear

trend stock, in this case the VRSN stock, are discriminated in Table 4.2.

Table 4.2: ARIMA results for a clear trend stock.Forecast MAE ROI Sharpe Ratio Accuracy

Daily 0.695 34.5% 1.906 57.8%

Weekly 1.256 36.7% 2.752 67.3%

Monthly 2.025 40.7% 3.556 90.9%

The best results in terms of error correspond to the daily forecast, and the best results of returns,

sharpe ratio and accuracy correspond to the monthly forecast.

Starting with the mean absolute error, it is reasonable to think that for a daily forecast the error should

be lower since all previous time steps are known and real. On the other hand, when forecasting for more

than one day ahead, there is a gap between the last real known price and the predicted price. This gap

is 5 in a weekly forecast and 22 in a monthly forecast so it reasonable to conclude that multi-step-ahead

forecast reduces the quality of the predictions, increasing the deviation from the actual values reflected

on a greater MAE.

Concerning the returns, as explained in the metrics section of the background, they are calculated

based on a simple strategy that relies on the forecasted values. Briefly, if the predicted price is higher

than the last known value, the strategy is ”going long” on that share; if the predicted price is lower

than the last known value, the strategy is ”going short” on that share. It is important to compare these

return with the B&H strategy to be able to frame the results in a fair context. The B&H strategy for this

specific stock gives a return of 31.2% in the same period as the test period of this work. ARIMA gives

better results in all three cases with returns superior to 34%, even though the implemented strategy is

very simplistic. This means that after one year, a trader entering the market with $1000 and using this

strategy based on the ARIMA predictions, can have more $340 in the end. The returns are better in the

monthly forecast, having returns of 40.7%. This is probably explained by the reduced number of trades

that are executed in a monthly forecast: only 11 compared to the 251 of the daily case. Reducing the

41

number of trades, even though the error is more evident in this case, the returns are higher because

there are fewer opportunities to fail the strategy. The same happens for the weekly forecast.

For the same reason, the accuracy of the monthly forecast is higher, since there are fewer opportu-

nities to fail in the strategy. The accuracy divides the number of right guesses by the total number of

trades, and in multi-step-forecast, the number of trades is reduced to 11.

Sometimes there can be strategies that lead to very optimistic returns but with a very high risk. This

is not ideal and that is why the hedge funds often want to maximize the return while minimizing the

risk, sometimes preferring strategies that lead to lower profits but have a more comfortable risk. In this

context, the sharpe ratio should be analyzed since reflects a trade-off between return and risk. In this

case, the sharpe ratio follows the returns and has very good values in all three ranges. More than 1 is

already considered as good, and more than 2 is considered to be very good, confirming that the ARIMA

results are indeed good results.

4.1.2 Sideways Stock

This subsection describes the ARIMA performance with a sideways stock, Franklin Templeton Invest-

ments stock (BEN). This series does not seem to be stationary since it does not show a constant mean

and variance. The results of the Dickey-Fuller test are described in Figure 4.5. Although the Test Statis-

tic value is not as high as the Test Statistic value for the VRSN stock, it is still greater than the Critical

Values.

Figure 4.5: Results of Dickey-Fuller Test of the original series.

In order to have a stationary time series the variance should be stabilized and so is the mean. To do so, a

logarithmic transformation is applied and the first-differences are calculated. After these transformations,

a second test, Figure 4.6, is conducted to see if the series became stationary. The Test Statistic is now

much lower than the critical values. Comparing with the clear trend stock, this stock shows more signs

of stationarity according to the Dickey-Fuller test.

Next step is to find the best ARIMA order to predict daily, weekly, and monthly prices using this

sideways stock. All three approaches use the same stationary time series.

For the daily forecast, 251 days are predicted and, for each prediction, a strategy is calculated. In

order to find the best combination of parameters, all combinations of ARIMA (p,d,q) were tested, with p

42

Figure 4.6: Results of Dickey-Fuller Test of stationary series.

and q with ranges between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were

made in order to predict the best combination based on the lowest MSE. The result of this brute search

is the ARIMA (3,1,3), with three autoregressive terms, one order of difference (like expected) and three

moving average terms.

For the weekly forecast, 49 points are predicted and, for each prediction, a strategy is calculated.

In order to find the best combination of parameters for the weekly forecast, again all combinations of

ARIMA (p,d,q) were tested, with p and q with ranges between 0 and 5 and d with a range between 0 and

2. In total, 50 executions were made in order to predict the best combination based on the lowest MSE.

Now, this MSE is related to the weekly forecast values. The result of this brute search is the ARIMA

(0,1,1), with zero autoregressive terms, one order of difference (like expected) and one moving average

terms.

Finally, the monthly forecast is a prediction for the next 22 days and in the end, 11 points are pre-

dicted, and for each prediction, a strategy is calculated. In order to find the best combination of param-

eters for the monthly forecast, all combinations of ARIMA (p,d,q) were tested, with p and q with ranges

between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were made in order to pre-

dict the best combination based on the lowest MSE. Now, this MSE is related to the monthly forecasted

values. The result of this brute search is the ARIMA (0,1,1), with zero autoregressive terms, one order

of difference (like expected) and one moving average terms.

The ARIMA performance results for forecasting daily, weekly and monthly future prices of a clear

trend stock, in this case the BEN stock, are discriminated in table 4.3

Table 4.3: ARIMA results for a sideways stock

Forecast MAE ROI Sharpe Ratio Accuracy

Daily 0.362 -29.5% -2.034 44.6%

Weekly 0.776 -4.7% -0.318 44.9%

Monthly 2.230 -15.2% -0.908 36.4%

Analyzing only the error values, it seems that the predictions are very good and close to real values.

The error increases when forecasting weekly and monthly prices, but the values are still not extremely

deviated from the real ones.

Although this information may give an optimistic perception about the ARIMA performance in a side-

ways stock, the return values do not show the same optimism. For a daily forecast, the return on

investment is even negative, reaching -29.5%, corresponding to a huge loss of money. For example, a

43

trader entering the market with $1000 and using ARIMA daily predictions to decide a strategy will lose

approximately $300 at the end of one year. Besides this negative returns, the weekly and monthly fore-

cast gives higher values of ROI, -4.7% and -15.2% respectively. Comparing these results with the B&H

strategy, that gives -0.74% of returns for this stock in the same period as the test period, the ARIMA re-

turns do not show an improvement compared to this strategy, always leading to a higher loss of money.

The weekly forecast gives better results than the other two approaches, but taking into account the daily

results, even though the weekly results may seem better, it is possible that they are just influenced for

the fact that very few points are predicted.

In fact, the dispersion of returns for a given sideways stock, also known as volatility, is much higher

than the dispersion of returns for a clear trend stock. Commonly, the higher the volatility, the riskier the

investment. This fact is reflected in the sharpe ratio values that are very low since a good value has to

be 1 or greater. The sharpe ratio is even negative for all the situations. This is due to the high volatility

of the stock, turning the investment a riskier move.

The accuracy of the strategy based on the ARIMA predictions for this sideways stock does not have

very high values, never being greater than 50%. Accuracies less than 50% means that the algorithm

fails more often than it hits, being almost better not to take the strategy at all.

4.1.3 ARIMA performance conclusion

In this subsection, the results obtained are compared to each other and a conclusion is presented about

the ARIMA performance. For a better comparison, the implemented algorithm is compared taking into

consideration some related works about ARIMA models. The merged results for the two stocks, VRSN

and BEN, are represented in Table 4.4.

Table 4.4: ARIMA Performance.Clear Trend Sideways

Daily Weekly Monthly Daily Weekly Monthly

MAE 0.695 1.256 2.025 0.362 0.762 2.230ROI 34.5% 36.7% 40.7% -29.5% -4.7% -15.2%SR 1.906 2.752 3.556 -2.034 -0.318 -0.908

Accuracy 57.8% 67.3% 81.8% 44.6% 44.9% 36.4%

The best set of parameters corresponds to the monthly forecast for the clear trend stock, VRSN stock.

On average, the ARIMA results for the clear trend stock are better than the ones for the sideways stock.

Chan et al. [7] also stated that ARIMA model does not fit well at the beginning of a downward/upward

period, and that should be used when a clear trend is shown, such as the VRSN stock. Although this

appears to be true, the presented errors for the sideways stock show good results. The same cannot be

said of the sideways stock returns since in the clear trend stock situation the returns are always superior

to the one obtained by the B&H strategy while in the sideways stock all the returns are exceeded by the

B&H strategy.

44

One of the works referenced in the related work section conducetd by Rounaghi et al. [9] tries to

forecast the S&P500 and the London Stock Exchange with ARIMA using data between 2007 and 2013.

These works presents very small values for MAE, reaching the small value of 0.0283 for the monthly

forecast for S&P 500 index, but does not show any metrics relative to returns, risk or accuracy. Vantuch

et al. [10] predicted future prices of Microsoft shares and none of the predictions had errors inferior to

0.5, being the majority of the errors superior to 3 reaching values of 6 and 7. Again these predictions are

not applied to strategies and one can not have the sense of how much profit or loss these predictions

could lead, but the MAE results of this thesis approach is superior to the ones obtained by Vantuch et

al. [10].

Concluding, ARIMA performs better in a clear trend stock reaching very good profit and accuracy

results. When applied to a sideways behaves good considering only the error values, and very poorly

considering the returns since it is harder to fit when there are downward/upward moves. The MAE values

obtained are superior to the work conducted by Vantuch et al. [10].

4.2 K-Nearest Neighbors Performance

In this section, the K-Nearest Neighbors algorithm is applied to the same two stocks used to evaluate

the ARIMA performance. While the ARIMA model uses all the previous data as an input to predict

the future prices, with machine learning algorithms, and specifically with KNN, it is not like that. K-

Nearest Neighbors uses supervised learning, and so the first thing to do is to reframe the data into a

(features,target) format. There are four tested numbers for the features: 5, 10, 15 and 22. These values

correspond more or less to one, two, three and four weeks of prices that characterize the target. No

more than 22 features are tested since it is assumed that a price is more influenced by the prices that

are closer. The number of features is also referred along this work as window width.

After this reformulation, the algorithm only uses a set of previous values (targets) to predict the next

price. The number of previous prices that the KNN uses is the number of neighbors, represented as

“K”. It is important not to confuse the number of features and the number of neighbors. Features are

just prices that describe one price; for example, the price of today is described by the last five prices.

The number of neighbors is the number of past prices (targets) that KNN uses to predict the next price.

Besides these steps, the data is divided in the same proportion as it was in the ARIMA implementation,

being 80% for the training phase and the rest 20% for the test period.

4.2.1 Clear Trend

This subsection presents the KNN performance when a clear trend stock is used to forecast future

prices.

For the clear trend, the KNN is used to forecast daily, weekly, and monthly prices. For each of these

three options, the algorithm is tested using different values for the window width (features). The KNN is

then tested with 5, 10, 15, and 22 features. For each of these window widths, all values of K between

45

1 and 50 are tested using a grid search with 10-Fold cross-validation. This K parameter of the KNN

algorithm is the number of previous prices (neighbors) used to predict the next day price. This is why

it was said previously that KNN does not use all the past prices as an input to the forecast function,

contrary to the ARIMA models. These neighbors do not have the same importance and influence, being

the more recent neighbors the ones having a heavier weight.

For the daily forecast, the optimal number of features and neighbors founded during the training

phase was 5 features and 23 neighbors. For the weekly forecast, the best-founded model was for 5

features and 28 neighbors. Finally, for the monthly forecast, the resulted model after the training phase

had 5 features and 29 neighbors. The results obtained by these three models are described in Table

4.5.

Table 4.5: K-Nearest Neighbors results for a clear trend stock.


Daily 10.838 -8.3% -0.591 44.6%

Weekly 12.073 -16.6% -1.568 40%

Monthly 16.080 -24.3% -2.754 18.2%

It is clear that the KNN does not have very good results in this clear trend stock forecast. The errors

for the daily, weekly and monthly forecast are very high and there are no positive returns. Even though

the daily forecast gives a better return, its value is way below the B&H strategy result of 32.2%. The

weekly and monthly forecast, both with 5 features respectively, are negative generating losses for a

trader that follows the strategy based on these predictions. Also, the sharpe ratio is very low, and the

highest sharpe ratio obtained for the daily forecast does not correspond to a good return. The accuracies

are all below 50% being very low for common accuracy values. One of the motives behind these poor

results is the behavior of this particular stock. This clear trend stock, VRSN, is almost always increasing

in price. The KNN does the average of the nearest neighbors, and if the prices are mostly increasing,

this average will most of the times correspond to a price that is lower than the actual price leading to a

wrong strategy and consequently to a negative return.

4.2.2 Sideways stock

This subsection presents the results of KNN for a sideways stock, meaning a stock that has uptrends

and downtrends mixed together.

For the sideways stock, BEN stock, the KNN is also used to forecast daily, weekly, and monthly

prices. Again, for each of these three options, the algorithm is tested using different values for the

window width. The KNN is tested with 5, 10, 15, and 22 features. For each of these window widths, all

values of K between 1 and 50 are tested using a grid search with 10-Fold cross-validation.

For the daily forecast, the optimal number of features and neighbors founded during the training

phase was 5 features and 16 neighbors. For the weekly forecast, the best-founded model was for 5

features and 29 neighbors. Finally, for the monthly forecast, the resulted model after the training phase

46

had 5 features and 22 neighbors. Weights are assigned to the chosen neighbors so that the nearest

neighbors contribute more to the average than the more distant ones. The results obtained by these

three models are described in Table 4.6.

Table 4.6: K-Nearest Neighbors results for a sideways stock.


Daily 0.810 -28.4% -1.704 44.6%

Weekly 3.378 -1.5% -0.082 40%

Monthly 5.574 -15.2% -0.908 36.4%

For the sideways stock, the lowest error value corresponds to the daily forecast. The rest of the errors

grow with the forecast range reaching a value superior to 5 in the monthly forecast. The errors are not

ideal even though they are not so bad.

The returns are all negative and inferior to the B&H since the B&H strategy gives a return of -0.74%

for this specific stock in this specific period. The daily return is the lowest one, even though it corresponds

to the situation with the lowest error value, proving again that a good value for the error does not mean

a great return.

All the investments represent a very high risk since the sharpe ratios are all negative. The less risky

investment is the weekly forecast.

In general, the accuracies do not present very high results, being all less than 50%, being the weekly

accuracy the one with the lower value.

4.2.3 KNN performance conclusion

In this subsection, the KNN performance for the clear trend and the sideways stocks are compared and

analyzed in order to take some conclusions about the KNN behavior. The algorithm is also compared to

the state of the art in order to gain a context of the presented values. The KNN results for the stocks are

described in Table 4.7.

Table 4.7: KNN performance.

Clear Trend Sideways


MAE 10.838 12.073 16.080 0.810 3.378 5.574ROI -8.3% -16.6% -24.3% -28.4% -1.5% -15.2%SR -0.591 -1.568 -2.754 -1.704 -0.082 -0.908

Accuracy 44.6% 40% 18.2% 44.6% 40% 36.4%

It is clear that the results obtained with the sideways stocks are better than the ones obtained for the

clear trend. Starting with the errors, the clear trend forecast presents very high values of MAE, being

the lowest error higher than 10. The errors for the sideways stock are always less than 6, being very

47

low compared with the clear trend errors. The error increases with the range of the forecast in both of

the stocks since KNN starts to use neighbors for its prediction function that are not actual values but are

also predictions.

The results for the returns are also better for the sideways stock, even though they are all below the

B&H strategy. It is important to remember that the B&H strategy gives 32.2% for the clear trend and

-0.74% for the sideways stock. It also relevant to say that even though the returns are a valid evaluation

metric for this work, they are calculated based on a very simplistic strategy.

The sharpe ratio is not considered good in any of the cases because it is always less than 1, meaning

that in the two stock situations the investments were risky. Finally considering the accuracies, the values

are very low and are all less than 50%.

Overall the KNN did not show very good results in the forecast task but shows a better performance

for a stock that does not have a clear trend. The reason why this happens is that KNN does the weighted

average of the last K neighbors and when a price is constantly increasing, this average will always be

less than the nearest neighbor, and consequently less than the actual price for the next day. On the

other hand, in a stock with a higher volatility, it is easier to have a more approximate result of the price

for the day after since the prices are always with up and down moves.

It is also important to observe that the results that are presented in table 4.7 correspond to a specific

model with a specific number of features and neighbors that were obtained during the training period.

For example, for the daily forecast of the clear trend stock, the presented results are the result of the

best combination between the number of features and number of neighbors, in this case 5 features and

23 neighbors. All the situations had an optimal number of features of 5, meaning that there is no need

to use a lot of features to discover the optimal model. It is also curious to observe that for almost all

the cases the number of neighbors is never higher than 29, meaning that to minimize the error it is not

needed to have a great number of neighbors.

Chen and Hao [17] also used KNN to predict future stock prices and presented their results con-

sidering the mean absolute percentage error (MAPE). MAPE can be problematic since it can cause

division-by-zero errors and it is not used as an evaluation metric along this work. However, in this

specific case, the MAPE is calculated in order to be possible to compare the results obtained by the

weighted KNN implemented in this work and the one implemented by Chen and Hao [17]. The results

of MAPE for the implemented KNN are shown in Table 4.8.

Table 4.8: KNN MAPE.Clear Trend Sideways


MAPE - 0.218 1.547 - 0.153 1,258

In table 4.8, the entries with “-” are the ones with a division-by-zero error. Chen and Hao [17] obtained

a MAPE of 0.18 for a daily forecast, 0.22 for a weekly forecast, and there are no values presented for

the monthly forecast, probably because of the division-by-zero error. The only possible comparison is

48

then for the weekly forecast, and the values obtained in the implemented work show better results than

the ones obtained by Chen and Hao [17]. Even though the data sets are not the same, this comparison

gives an idea of how KNN performed in this work compared to the other. Dash and Dash [22] also used

KNN but in its classification form. It was used to generate buy and sell signals and took to very high

profits of 30% in BSE SENSEX data set. In this case, it is not possible to compare since in this work the

KNN was used as a regressor.

4.3 Support Vector Regression Performance

In this section, the Support Vector Regression algorithm is tested in a test data set of 1 year for a daily,

weekly and monthly forecasts. Such as K-Nearest Neighbors, SVR uses supervised learning so the

price sequence as to be reframed into a features and targets format. Again, the algorithm is tested

having 5, 10, 15, and 22 features. The SVR is trained for each of these window widths, and the best

hyper-parameters are calculated with a grid search using 10 fold cross-validation. The best combination

of hyper-parameters is chosen based on the lowest MSE and in the end, between the best sets of

hyper-parameters for each of the window widths the number of features having a lower MSE is the

chosen window width that will be used during the test period.

In SVR there are three hyper-parameters to tune: the kernel, the gamma and the C parameter.

The parameter ε is set to 0.1 since it is its default value . The kernel can be linear, polynomial or

Gaussian, also known as radial basis function (RBF). Contrary to the linear and polynomial kernels that

are considered parametric models, the RBF is non-parametric having a complexity potentially infinite

that can grow with the data, representing more complex relations outperforming the parametric kernels.

Even though the RBF kernel is promising, sometimes the linear and polynomial kernels give better

results, so it always advisable to test the three options. The soft-margin parameter, C, can takes values

[1, 10, 100, 1000], and the kernel parameter, gamma, can take values [0.1, 0.01, 0.001, 0.0001] if the

kernel is RBF. The parameters are calculated using a grid search with a 10-Fold cross-validation.

The data is divided in the same proportion as it was in the two last implementations, being 80% for

the training phase and the rest 20% for the test period. The algorithm is applied to the same stocks as

it were ARIMA and KNN: a clear trend stock and a sideways stock.

4.3.1 Clear Trend

This subsection presents the SVR performance in a clear trend for a daily, weekly and monthly forecast.

During the training phase, the C and the gamma parameters are optimized inside the range referenced

before.

For the daily forecast, the best results were obtained using 22 features, a soft-margin of 100, and

a polynomial kernel. In the weekly forecast, the optimal number of features is 15, the parameter C is

1000 and the kernel is polynomial. Finally, for the monthly forecast, the best results were obtained for

5 features with a C equals to 100, and an RBF kernel with gamma 0.01. The results obtained by the

49

Support Vector Regression algorithm in the clear trend are described in Table 4.9.

Table 4.9: Support Vector Regression results for a clear trend stock.


Daily 4.552 23.41% 1.368 57.8%

Weekly 2.463 8.7% 0.725 53.1%

Monthly 13.004 -23.4% -2.166 27.3%

The results of SVR in this clear trend stock are not very optimistic. The errors are high for all the

forecast ranges. The returns are completely exceeded by the B&H strategy, and the monthly return

is even negative. The sharpe ratio is higher than 1 in the daily forecast representing a not so risky

investment, but the daily return of 23.41% is very low compared to the 33.2% of the B&H. The accuracy

of the daily case is very good compared to the others. The monthly accuracy is very low so maybe these

values should not be taken into account. Also in the monthly forecast, the kernel is the only radial basis

functions, since the other two cases, daily and weekly, use polynomial kernels.


This subsection presents the SVR performance in a sideways stock for a daily, weekly and monthly

forecast. During the training phase, the C and the gamma parameters are optimized inside the range

referenced before.

For the daily forecast, the best results were obtained using 10 features, a soft-margin of 1000, and a

polynomial kernel. In the weekly forecast, the optimal number of features is 5, the parameter C is 100

and the kernel polynomial one. Finally, for the monthly forecast, the best results were obtained with 15

features for a C equals to 10, and again an RBF kernel with a gamma of 0.01. The results obtained by

the Support Vector Regression algorithm in the sideways stock are described in Table 4.10.

Table 4.10: Support Vector Regression results for a sideways stock.


Daily 0.807 1.9% 0.101 51.8%

Weekly 1.422 -4.4% -0.227 50%

Monthly 3.662 -1.5% -0.078 63.6%

Considering the errors, the daily forecast has an acceptable value of MAE, that increases in the

weekly, and again in the monthly forecast. The returns are also acceptable taking into account that is

a sideways stock and the B&H strategy for this stock is -0.74%. The weekly and monthly returns are

lower than -0.74%, having a negative value of -4.4% and -1.5%. The sharpe ratio indicates that all the

investments are risky since they are all less than 1. Only the monthly accuracy takes a good value being

higher than 60%.

50

4.3.3 Support Vector Regression performance conclusions

In this subsection, the SVR performance for the clear trend and the sideways stocks are compared and

analyzed in order to take some conclusions about the SVR behavior. The results are also compared

with works referenced in section 2.1. The SVR results for the two stocks are described in Table 4.11.

Table 4.11: Support Vector Regression performance.

Clear Trend Sideways


MAE 4.552 2.463 13.004 0.807 1.442 3.662ROI 23.4% 8.7% -23.4% 1.9% -4.4% -1.5%SR 1.368 0.725 -2.166 0.101 -0.227 -0.078

Accuracy 57.8% 53.1% 27.3% 51.8% 50% 63.6%

The Support Vector Regression is clearly superior when used to forecast the sideways stock. Starting

with the errors, the clear trend presents very high values of MAE, opposed to the sideways errors that

even though they are not ideal, have a considerably lower average value. Also in the clear trend, the

returns never exceed the B&H returns for this stock, 33.2%, while in the sideways stock the daily forecast

exceeded the -0.74% of the B&H strategy. On average, the sharpe ratios of the sideways stock are lower

than the ones calculated for the clear trend investments, meaning it is riskier to invest in a stock that has

constant up and down moves. There are four accuracies higher or equal to 50% and one considerably

low, the monthly clear trend forecast.

Concluding, the SVR behaves better for a sideways stock than for a clear trend stock. This may be

due to the complexity of the model, that can perceive complex relations between the data and sometimes

failing in the more linear relations.

Tay and Cao [15] also applied SVM to financial time series forecasting. A true and fair comparison

cannot be done since they did not use the same data set, neither they use the same evaluation metrics,

being only concerned about the error measures. The only metric that appears in their work and in

this thesis is the mean absolute error, MAE. Besides this, is interesting to observe their conclusion to

contextualize the results obtained. Considering that only the best results are presented, the researches

obtained MAE values between 0.2361 and 0.4105. These values are very good comparing to the ones

obtained, and only the daily forecast for the sideways stock is close to Tay and Cao results [15].

4.4 ARIMA vs. KNN vs. SVR

This section presents a comparison between ARIMA, K-Nearest Neighbors and Support Vector Regres-

sion applied to daily, weekly and monthly forecast of a stock. The three algorithms are compared based

on the same metrics: mean absolute error, returns on investment, sharpe ratio and accuracy. Before

comparing the three implemented methods, it is important to understand the four metrics should not be

understood in the same way. The mean absolute error (MAE) is concerned about the precision of the

51

prediction and their deviation from the real values, so if one is more interested in precise forecast should

look and analyze the MAE. On the other hand, the returns on investment (ROI), the sharpe ratio (SR)

and the accuracy are more concerned with the utility of these predicted values since the algorithms are

applied to financial data. These three metrics reflect the results of a very simple strategy based on the

obtained predictions, so if one is more interested in defining strategies to invest in the stock market,

these are the metrics to look at.

4.4.1 Clear Trend Stock

The performance of the three algorithms is discriminated in Table 4.12 and a deep comparison is con-

ducted below.

Table 4.12: ARIMA vs. KNN vs. SVR in a Clear Trend Stock.Daily Weekly Monthly

MAEARIMA 0.695 1.256 2.025KNN 10.838 12.073 16.080SVR 4.552 2.463 13.004

ROIARIMA 34.5% 36.7 % 40.7 %KNN -8.3% -16.6% -11.9%SVR 23.4% 8.7% -24.3%

SRARIMA 1.906 2.752 3.556KNN -0.591 -1.568 -2.754SVR 1.368 0.725 -2.166

AccuracyARIMA 57.8% 67.3% 90.9%KNN 44.6% 40% 18.2%SVR 57.8% 53.1% 27.3%

It is clear that exists one that stands out: the ARIMA model. ARIMA has the lowest value for MAE

in all daily, weekly and monthly forecast, and this difference is more pronounced as the forecast range

increases. The higher value of MAE is for KNN when trying to forecast in a monthly range, being 16.400.

In general, the KNN errors for this stock are very high. The SVR also has high values of MAE, although

they are not so high as the KNN ones. It is observable that for all three cases the error increases with

the range of the forecast being higher for the weekly forecast and consequently for the monthly forecast.

To better understand the dimension of the errors obtained by the three algorithms, the daily predictions

made by each of them are compared against the real actual values of the training data sets. Figure 4.7

illustrates this comparison.

The model that is more close the actual price is the ARIMA model, as it was expected since it is the

model with the lowest daily MAE for the clear trend, 0.695. Relatively to KNN, in the beginning of the test

period, the algorithm seems to fit the model but quickly starts to output bad results. The same happens

with SVR. The SVR is able to more or less predict the volatility of the prices but not the actual values,

being a little above the actual values, with a MAE of 4.552.

Considering the returns, the comparison term for its evaluation is the B&H strategy which gives a

52

Figure 4.7: Comparison of the three algorithms in a clear trend stock.

return of 33.2% for this specific stock. This value is only exceeded by all ranges of ARIMA forecast, and

KNN and SVR give lower values in all the forecast ranges. Even though these two algorithms do not

give such good results as the ARIMA model, the SVR is superior to the KNN in the returns.

The sharpe ratio reflects a good strategy in all ARIMA predictions and also in the daily forecast for

SVR, proving again that SVR performs better than the KNN.

The accuracies are very high in the ARIMA case, and not so high in the two machine learning

algorithms. Again, even though the SVR does not show good accuracies, it performed better than the

KNN model that has accuracies all less than 50%.

Concluding, the ARIMA model performs very well in a clear trend stock, outperforming the B&H

strategy and also the two machine learning algorithms. The Support Vector Regression outperformed

the KNN taking into account all the evaluation metrics. The KNN is very weak in its predictions and

seems to fit the data only in the beginning of the test period.


Table 4.13 presents the performance results for the three algorithms in a sideways stock, the BEN stock,

that does not show a clear up or down trend. Contrary to the clear trend stock, there is no model or

algorithm that stands out in this situation. The errors are similar between the three models and on

average they increase with the range of the forecast. The KNN is still the solution that shows higher

values of MAE and the lowest error is found again in the daily forecast of the ARIMA model. Figure 4.8

shows how far the predictions are from the real values, in other words, they illustrate the values of MAE.

The ARIMA predictions are very close to the actual values, and the KNN performs a lot better com-

pared to the clear trend stock. The SVR does not show very precise values and it is not obvious just by

looking at figure 4.8 which of the KNN or SVR is more accurate. Only looking at MAE values is possible

to check that SVR outputs values more close to the actual prices.

Concerning the returns, the results may not seem very optimistic. In fact, this specific stock is very

53

Table 4.13: ARIMA vs. KNN vs. SVR in a Sideways Stock.

Daily Weekly Monthly

MAEARIMA 0.362 0.776 2.230KNN 0.810 3.378 5.574SVR 0.807 1.422 3.622

ROIARIMA -29.5% -4.7% -15.2%KNN -28.4% -1.5% -15.2%SVR 1.9% -4.4% -1.5%

SRARIMA -2.034 -0.318 -0.908KNN -1.704 -0.082 -0.908SVR 0.101 -0.227 -0.078

AccuracyARIMA 44.6% 44.9% 36.4%KNN 44.6% 40% 36.4%SVR 51.8% 50% 63.6%

Figure 4.8: Comparison of the three algorithms in a sideways stock.

volatile, with a very inconstant behavior, being more difficult to invest and have good results. The B&H

strategy gives a profit of -0.74% being only exceeded by the daily SVR forecast. In this volatile stocks,

even when the errors are small, the returns can be very low since it is difficult to predict if the price will

increase or decrease. ARIMA has the lowest value of returns for the daily forecast, -29.5%.

The sharpe ratio is lower than 1 for all situations meaning that even the lowest return values represent

a very risky investment. The accuracies are high for the SVR algorithm, and they are very low for the

KNN method. ARIMA exhibits average results of accuracy, but all inferior to 50%.

4.4.3 Overall comparison

Forecasting in the stock market is not an easy task, and forecast results can sometimes be dubious.

The great part of the works related to forecasting only use either error metrics or return/profit/accuracy

metrics. When dealing with financial data it is extremely important to look at both metric types and learn

54

to find the appropriate trade-off between the two. For example, the daily results for the ARIMA forecast

applied to the sideways stock have a very low error value of 0.362, and if this value was evaluated alone

without the remaining context it would seem a very good and precise result. The returns for the same

daily forecast are very low, being negative and almost -30%. This is just an example that results should

be analyzed carefully and inside the context.

To solve the problem of forecasting in the stock market, the ARIMA gives results very close to the

actual values and their results are consistent along the test period. This model performs a lot better

when applied to a clear trend stock, and it does not fit well when consistent up and down movements are

happening. When dealing with the clear trend stock outperforms the two machine learning algorithms.

ARIMA can also give very high returns when used to construct a strategy, giving returns way above the

current values. ARIMA is also a very simple statistic algorithm and it is not very complex, compensating

with its low computational effort. Besides this, for this work purposes, only values inside a range between

0 and 5 were tested since the computational resources were limited.

The two machine learning algorithms performed not as expected, being overpast by the ARIMA

model in the clear trend stock. Support Vector Machines performed better than the KNN model, mainly

because the KNN is a very simple and lazy-algorithm with a slow learning, and SVR is a more com-

plex one that with more complex kernels can perceive complex relations between data. Also, KNN is

more used in classification tasks, where feature scaling is very common and helps a lot the algorithms

performance.

Another clear conclusion is that looking at the MAE values, as the range of the forecast raises, also

it rises the error, and this happens for almost all situations. Also, the results prove that good values of

errors do not mean good returns, and good returns do not mean a good investment since the risk must

be taking into account. When dealing with KNN, there are no K values superior to 29, indicating that

there is no need to try very high values of K. Considering SVR, the polynomial kernel is the one with

better results, even though the RBF is mostly used with complex data such as financial data.

Table 4.14 shows which was the algorithm that obtained the best classification for each of the two

stocks in each of the four metrics.

Table 4.14: Best results for each stock.Best ARIMA KNN SVR

Clear Trend Stock

MAE xROI xSR xAccuracy x x

Sideways Stock

MAE xROI xSR xAccuracy x

Both the machine learning algorithms did not live up to the expectations. Looking at Figure 4.7 and

Figure 4.8, it seems that both KNN and SVR can fit the data in the beginning much better than they did in

the end. This can be caused by the fact that the hyper-parameters start to be out of data and the model

needs to be retrained. Taking this into consideration, the retraining of the machine learning algorithms

55

was conducted in order to compare the results and see if it gives more precise values.

4.5 Studying the impact of retraining KNN and SVR

The introduction of a retraining period is due to the fact that, concerning the mean absolute errors,

the two machine learning algorithms present good predictions at the beginning of the test period and

deteriorate their performance over time. This observation leads to the suspicion that the algorithms may

lose their ability since they were trained before the test period and their hyper-parameters can be out of

date. It is important to give the opportunity to the models to learn the price trend from the past, but it

also relevant that the algorithms use up to date information, not being only influenced by data located

too far in the past.

Taking this into consideration, a retraining step is introduced for both KNN and SVR in order to check

if their results in terms of mean absolute error improve. This retraining step is incorporated during the

test period and the retraining period varies between 5, 10 and 15. A retraining period of 5 means that for

each 5 executed trades, the algorithm is retrained and new hyper-parameters are calculated and used

until the next retrain.

This approach is used for the KNN and SVR performance in both stocks for a daily forecast and

compared with the results obtained in sections 4.2 and 4.3.

The new daily results of the KNN with a 5-period retraining for the clear trend stock are illustrated in

Figure 4.9. The KNN MAE value dropped from 10.838 to 1.330. KNN had the most inadequate results

Figure 4.9: KNN with Retraining for a Clear Trend Stock

in relation to the actual values for the clear trend forecast, and the retraining every 5 trades improved

a lot its error and consequently the returns, shaper ratio and accuracy of the strategy based on KNN

predictions. These results were obtained for the same test period and number of features that were used

in Section 4.2.

Even though the KNN had a better performance in the sideways stock, its results are also improved

56

by the introduction of a retraining period, reducing the error and improving the returns, sharpe ratio and

accuracy of the algorithm. The optimal retraining period was again every 5 trades. The improvements

are illustrated in Figure 4.10 for the daily results. The KNN MAE value dropped from 0.810 to 0.545.

Figure 4.10: KNN with Retraining for a Sideways Stock.

Also, the SVR was way below the expectations and a retraining was introduced during the test period.

The retrain period varies between 5, 10 and 15. For the clear trend, the SVR predicted values that were

higher than the actual prices, even though the shape of the output time series was similar. With the

introduction of the retraining every 5 trades, the SVR error improved a lot but the same did not happen

with the returns, sharpe ratio and accuracy. The results are illustrated in Figure 4.11. The SVR MAE

value dropped from 4.552 to 0.990.

Figure 4.11: SVR with Retraining for a Clear Trend Stock.

Concerning the sideways stock, the results of the SVR errors are better than without taking the

retraining. Again the optimal period number is every 5 trades. The SVR MAE value dropped from 0.807

57

to 0.366. The results are illustrated in Figure 4.12.

Figure 4.12: SVR with Retraining for a Sideways Stock.

Concluding, the retraining of the machine learning algorithms is something to take in consideration

when forecasting for a long test period and show improvements in every case. The chosen period for all

the cases was every 5 trades. This may not work in all situations since it increases the computational

effort. This period should be optimized for each algorithm in its specific context.

58

Chapter 5

Conclusions and Future Work

5.1 Conclusions

The objectives of this work were to study stock price sequences as time series and to introduce the

use of forecast to predict future prices. To do this task, the objective was to implement one statistical

model and two machine learning techniques, conducting a comparison between the three of them when

forecasting with a daily, a weekly and a monthly range. For that, price sequences were used in all three

cases in order to have a fair comparison, even though the machine learning algorithms are commonly

used with technical indicators.

To complete the proposed goals, the chosen techniques were ARIMA model, K-Nearest Neighbors,

and Support Vector Regression, being the ARIMA model a statistical approach, the K-Nearest Neighbors

a simple machine learning algorithm, and Support Vector Regression a more complex and advanced

machine learning technique. For each of the three algorithms, an optimization of the hyper-parameters

was conducted in order to have the best possible model for each situation base on the lowest MSE. To

compare the three solutions, a simple strategy was calculated based on the forecasted values and four

metrics were used to evaluate both the prediction and the strategy: the mean absolute error, the returns

on investment, the sharpe ratio, and the accuracy. The algorithms were tested in two different types of

stocks: a clear trend stock and a sideways stock.

The best performance for the clear trend stock corresponds to the ARIMA model that exceeds the

B&H strategy, with returns of 40.7%. For the sideways stock, the SVR was the one with highest returns,

even though they were not very good. The two machine learning algorithms demonstrated a good fit in

the beginning of the test period and started to degrade their performance over time. The introduction of

a retraining period was tested in order to find if the results could be improved. Due to this introduction,

both algorithm results were better than the ones obtained before, proving the point that model retraining

should be considered when forecasting for a long test period.

There are some points that are worth to be enumerated in order to have a brief understanding of the

reached conclusions:

1. The choice of the evaluation metrics are extremely important, and a reliable comparison between

59

algorithms should not be conducted based only on one metric;

2. The ARIMA model has very good results in a clear trend stock, while in a sideways stock the same

does not happen;

3. K-Nearest Neighbors is a very simple algorithm that does not fit the stock data very well due to

the complexity of price moves and simplicity of the algorithm, being the weakest implemented

algorithm;

4. Support Vector Regression performs better than the K-Nearest Neighbors in modeling and fore-

casting financial data but it exceeded by the ARIMA model in a clear trend stock;

5. Machine learning algorithms can lose their validity and introducing a retraining along the test period

improves a lot the error results.

Concluding, forecast is a very difficult task, and even more in the financial field where the prices can

be so unpredictable. The flow of a stock price can have multiple channels of influence, and in this work

only past price sequences were used, probably impairing the results. Machine learning algorithms are

used in the financial field more often as classifiers than as regressors, which turned the task of forecast a

continuous value (close price) more challenging. In the end, the three algorithms show very interesting

results even though only price sequences were used as their input, making the point that the idea of

forecast future prices as continuous variables can be a very promising tool for investors and traders.

5.2 Future Work

For future works, the present thesis should be seen as a starting point in the forecast of stock prices as

continuous variables. To continue this work, the following approaches can be conducted:

1. For each of the implemented algorithms, conduct a more deep study about the influence of each

of the hyper-parameters for each of the models;

2. Evaluate the models with more refined and complex strategies since the implemented one was

very simplistic and only to gain an idea of how useful the predictions were;

3. Optimize the retraining periods since only 3 periods were tested;

4. Integrate the solution into a Big Data platform in order to process more data in a more quickly way;

5. The use of different algorithms to forecast future prices that can work as regressors;

6. Combining price sequences with fundamental analysis in order to enter with more channels of

influence.

60

Bibliography

[1] B. Marr. “A Short History of Machine Learning”. Forbes, pages 1–

2, 2016. URL http://www.forbes.com/sites/bernardmarr/2016/02/19/

a-short-history-of-machine-learning-every-manager-should-read/#7eaad602323f.

[2] Investopedia.com. “Technical Analysis Tutorial”. pages 1–42, 2010. URL https://www.

investopedia.com/exam-guide/series-7/portfolio-management/technical-analysis.asp.

[3] G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time Series Analysis: Forecasting

and Control, 5th Edition. 2015.

[4] T. G. Dietterich. “Machine learning in ecosystem informatics and sustainability”. 2009. ISBN

9781577354260. doi: 10.1007/978-3-540-75488-6 2.

[5] A. Ng. “Lecture CS229: Machine Learning.” Stanford University. 2011. URL http://cs229.

stanford.edu.

[6] R. C. Steorts. “Lecture STA 325, Chapter 3.5 ISL - Comparison of Linear Regression with K-Nearest

Neighbors.” Duke University. URL http://www2.stat.duke.edu/~rcs46/lectures_2017/03-lr/

03-knn.pdf.

[7] E. G. Chan, S. Fellow, P. H. Director, and S. Program. “Forecasting the S&P 500 Index Using Time

Series Analysis and Simulation Methods”. Submitted to the MIT Sloan School of Management and

the School of Engineering, 2009.

[8] E. A. Gerlein, M. McGinnity, A. Belatreche, and S. Coleman. “Evaluating machine learning classifi-

cation for financial trading: An empirical approach”. Expert Systems with Applications, 54:193–207,

2016. ISSN 09574174. doi: 10.1016/j.eswa.2016.01.018. URL http://dx.doi.org/10.1016/j.

eswa.2016.01.018.

[9] M. M. Rounaghi and F. Nassir Zadeh. “Investigation of market efficiency and Financial Stability

between S&P 500 and London Stock Exchange: Monthly and yearly Forecasting of Time Series

Stock Returns using ARMA model”. Physica A: Statistical Mechanics and its Applications, 456:

10–21, 2016. ISSN 03784371. doi: 10.1016/j.physa.2016.03.006. URL http://dx.doi.org/10.

1016/j.physa.2016.03.006.

61

http://www.forbes.com/sites/bernardmarr/2016/02/19/a-short-history-of-machine-learning-every-manager-should-read/#7eaad602323f

http://www.forbes.com/sites/bernardmarr/2016/02/19/a-short-history-of-machine-learning-every-manager-should-read/#7eaad602323f

https://www.investopedia.com/exam-guide/series-7/portfolio-management/technical-analysis.asp

https://www.investopedia.com/exam-guide/series-7/portfolio-management/technical-analysis.asp

http://cs229.stanford.edu

http://cs229.stanford.edu

http://www2.stat.duke.edu/~rcs46/lectures_2017/03-lr/03-knn.pdf

http://www2.stat.duke.edu/~rcs46/lectures_2017/03-lr/03-knn.pdf

http://dx.doi.org/10.1016/j.eswa.2016.01.018

http://dx.doi.org/10.1016/j.eswa.2016.01.018

http://dx.doi.org/10.1016/j.physa.2016.03.006

http://dx.doi.org/10.1016/j.physa.2016.03.006

[10] T. Vantuch and I. Zelinka. “ECC 14 - Evolutionary Based ARIMA Models for Stock Price Forecast-

ing”. 2014. doi: 10.1007/978-3-319-10759-2 25. URL https://link.springer.com/content/

pdf/10.1007%2F978-3-319-10759-2_25.pdf.

[11] J. Kamruzzamana and R. A. Sarkerb. “Comparing ANN Based Models with ARIMA for Prediction of

Forex Rates”. ASOR BULLETIN, 22(2):2–11, 2003. URL http://www.asor.org.au/publication/

files/jun2003/Joarder.pdf.

[12] J. Mandziuk and P. Rajkiewicz. “Neuro-evolutionary system for FOREX trading”. 2016 IEEE

Congress on Evolutionary Computation, CEC 2016, pages 4654–4661, 2016. doi: 10.1109/CEC.

2016.7744384.

[13] P. Yoo, M. Kim, and T. Jan. “Machine Learning Techniques and Use of Event Information for

Stock Market Prediction: A Survey and Evaluation”. International Conference on Computational

Intelligence for Modelling, Control and Automation and International Conference on Intelligent

Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), 2:835–841, 2007. doi:

10.1109/CIMCA.2005.1631572. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.

htm?arnumber=1631572.

[14] K. J. Kim. “Financial time series forecasting using support vector machines”. Neurocomputing, 55

(1-2):307–319, 2003. ISSN 09252312. doi: 10.1016/S0925-2312(03)00372-2.

[15] L. Cao and F. E. H. Tay. “Application of support vector machines in financial time series forecasting”.

Omega, 29(4):309–317, 2001. ISSN 03050483. doi: 10.1016/S0305-0483(01)00026-3.

[16] W. H. Chen, J. Y. Shih, and S. Wu. “Comparison of support-vector machines and back propagation

neural networks in forecasting the six major Asian stock markets”. International Journal of Electronic

Finance, 1(1):49, 2006. ISSN 1746-0069. doi: 10.1504/IJEF.2006.008837. URL http://www.

inderscience.com/link.php?id=8837.

[17] Y. Chen and Y. Hao. “A feature weighted support vector machine and K-nearest neighbor algorithm

for stock market indices prediction”. Expert Systems with Applications, 80:340–355, 2017. ISSN

09574174. doi: 10.1016/j.eswa.2017.02.044.

[18] R. P. da Costa Barbosa. “Agents in the Market Place An Exploratory Study on Using Intelligent

Agents to Trade Financial Instruments”. 2011.

[19] D. Wang, X. Liu, and M. Wang. “A DT-SVM strategy for stock futures prediction with big data”.

Proceedings - 16th IEEE International Conference on Computational Science and Engineering,

CSE 2013, pages 1005–1012, 2013. ISSN 1949-0828. doi: 10.1109/CSE.2013.147.

[20] F. Liu, P. Du, F. Weng, and J. Qu. “Use clustering to improve neural network in financial time series

prediction”. Proceedings - Third International Conference on Natural Computation, ICNC 2007, 2

(Icnc):89–93, 2007. doi: 10.1109/ICNC.2007.796.

62

https://link.springer.com/content/pdf/10.1007%2F978-3-319-10759-2_25.pdf

https://link.springer.com/content/pdf/10.1007%2F978-3-319-10759-2_25.pdf

http://www.asor.org.au/publication/files/jun2003/Joarder.pdf

http://www.asor.org.au/publication/files/jun2003/Joarder.pdf

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1631572

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1631572

http://www.inderscience.com/link.php?id=8837

http://www.inderscience.com/link.php?id=8837

[21] J. Leskovec and A. Rajaraman. “Lecture CS345a: Data Mining - Clustering algorithms”. Stanford

University. 1975. URL http://dl.acm.org/citation.cfm?id=540298.

[22] R. Dash and P. K. Dash. “A hybrid stock trading framework integrating technical analysis with

machine learning techniques”. The Journal of Finance and Data Science, 2(1):42–57, 2016. ISSN

24059188. doi: 10.1016/j.jfds.2016.03.002. URL http://linkinghub.elsevier.com/retrieve/

pii/S2405918815300179.

63

http://dl.acm.org/citation.cfm?id=540298

http://linkinghub.elsevier.com/retrieve/pii/S2405918815300179

http://linkinghub.elsevier.com/retrieve/pii/S2405918815300179

statistical models and machine learning algorithms to ...€¦ · stock market news are everywhere:...

Documents