statistical models and machine learning algorithms to ...€¦ · stock market news are everywhere:...
TRANSCRIPT
Statistical Models and Machine Learning Algorithms toForecast Future Prices in the Stock Market
Ana Rita Silveira da Costa
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor(s): Prof. Nuno Cavaco Gomes HortaProf. Rui Fuentecilla Maia Ferreira Neves
Examination Committee
Chairperson: Prof. António Manuel Raminhos Cordeiro GriloSupervisor: Prof. Nuno Cavaco Gomes Horta
Member of the Committee: Prof. Alexandra Sofia Martins de Carvalho
June 2018
ii
Declaration
I declare that this document is an original work of my own authorship and that it fulfills all the require-
ments of the Code of Conduct and Good Practices of the Universidade de Lisboa.
iii
iv
Acknowledgments
Firstly, I would like to thank my supervisor Professor Nuno Cavaco Gomes Horta for the support and
knowledge he gave me during the development of this thesis. I would also like to thank my family,
specially my parents, who have always supported me along the whole thesis process and gave me the
opportunity to study in such a good university. Finally, a very special acknowledgment to Joao Salvado,
Ines Gil, Ines Goncalves, Joao Villa de Brito and Vera Pedras, who helped me not only during the thesis
development but also trough out the whole degree.
v
vi
Resumo
Os precos de acoes podem ser interpretados como series temporais que podem ser previstas, de forma
a melhorar os resultados para um investidor. Varios metodos encontram-se em desenvolvimento com
o objetivo de obter uma previsao mais precisa. A previsao de uma serie temporal e um problema de
regressao, visto ser uma variavel contınua que e prevista. A presente dissertacao aplica um metodo
estatıstico, ARIMA, e dois de machine learning, K-Nearest Neighbors (KNN) e Support Vector Regres-
sion (SVR), com vista a prever o preco das acoes. O presente trabalho apresenta previsoes diarias,
semanais e mensais, fazendo uso de acoes com diferentes caracterısticas. Os tres modelos estudados
sao comparados em cada uma das situacoes referidas, considerando o erro das previsoes, os retornos
de uma estrategia simples e ainda o risco e precisao da estrategia. Os dados utilizados para o perıodo
de treino correspondem a 4 anos de uma acao com uma tendencia clara e outra acao com tendencia
lateral. O perıodo de teste corresponde a 1 ano das mesmas acoes. O melhor resultado foi obtido com
o ARIMA numa previsao mensal, alcancando retornos de 40% e uma precisao de 90.9%. Os algoritmos
KNN e SVR demonstraram ser mais precisos em acoes de tendencia lateral, sendo as solucoes destes
superiores as solucoes obtidas com o ARIMA. Ambas as abordagens de machine learning beneficiam
da introducao de um retreino durante o perıodo de teste, tendo em alguns casos decrescido o erro em
10 vezes.
Palavras-chave: Series Temporais, Analise Preditiva, Stock Market, ARIMA, K-Nearest
Neighbors, Support Vector Regression
vii
viii
Abstract
Stock prices can be interpreted as time series that can be forecasted in order to improve the returns of
a trader. Several methods including statistics and artificial intelligence are being developed in order to
turn this prediction more accurate and reliable. Forecasting a time series is a regression problem since
it is a continuous variable that is being forecasted. This thesis applies a statistical method, ARIMA, and
two machine learning models, K-Nearest Neighbors and Support Vector Regression, in order to forecast
future stock prices. The presented work shows predictions in a daily, weekly, and monthly range using
different stocks with different characteristics. The three studied models are compared in each of these
situations considering the error of the forecasted values, the returns of a strategy that relies on these
predictions and the risk and accuracy of that strategy. The data sets that were used for the training
period correspond to 4 years of data of a clear trend stock and a sideways stock in order to present data
with different characteristics. The test period corresponds to 1 year of the same stocks. The best result
obtained was by the ARIMA model in a monthly forecast, reaching returns of 40% and an accuracy
of 90.9%. The K-Nearest Neighbors and Support Vector Regression algorithms are more precise in a
sideways stock being superior to the ARIMA solution. Both machine learning approaches benefit from
the introduction of a retraining during the test period, in some cases decreasing the error in 10 times.
Keywords: Time Series, Forecast, Stock Market, ARIMA, K-Nearest Neighbors, Support Vector
Regression
ix
x
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background and Related Work 5
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Concepts of Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Time Series characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Modeling and Forecasting Time Series . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Works about modeling and forecasting a time series . . . . . . . . . . . . . . . . . 17
2.2.2 Works on forecast concerning Big Data . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Proposed Architecture 25
3.1 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Architecture Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Data Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Train Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Forecast and validation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Results 37
4.1 ARIMA Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
xi
4.1.1 Stock with a Clear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.3 ARIMA performance conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 K-Nearest Neighbors Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Clear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Sideways stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 KNN performance conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Support Vector Regression Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Clear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Support Vector Regression performance conclusions . . . . . . . . . . . . . . . . 51
4.4 ARIMA vs. KNN vs. SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Clear Trend Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.3 Overall comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Studying the impact of retraining KNN and SVR . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Conclusions and Future Work 59
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Bibliography 61
xii
List of Tables
2.1 Description of the most common metrics used in statistical works . . . . . . . . . . . . . . 15
2.2 Description of the most common metrics used in computational finance . . . . . . . . . . 16
2.3 Algorithm comparison based on the Related Work . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Resume of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 ARIMA parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 SVR parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 KNN parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 ARIMA data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 ARIMA results for a clear trend stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 ARIMA results for a sideways stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 ARIMA Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 K-Nearest Neighbors results for a clear trend stock . . . . . . . . . . . . . . . . . . . . . . 46
4.6 K-Nearest Neighbors results for a sideways stock . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 KNN performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.8 KNN MAPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Support Vector Regression results for a clear trend stock . . . . . . . . . . . . . . . . . . 50
4.10 Support Vector Regression results for a sideways stock . . . . . . . . . . . . . . . . . . . 50
4.11 Support Vector Regression performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.12 ARIMA vs. KNN vs. SVR in a Clear Trend Stock. . . . . . . . . . . . . . . . . . . . . . . . 52
4.13 ARIMA vs. KNN vs. SVR in a Sideways Stock. . . . . . . . . . . . . . . . . . . . . . . . . 54
4.14 Best results for each stock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xiii
xiv
List of Figures
1.1 Problem to be solved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Example of a time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Example of PACF and ACF functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 SVM classification approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 ε-insensitive loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Representation of the problem architecture (adapted from [19]) . . . . . . . . . . . . . . . 21
3.1 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Pseudo-code for the transformation of a .csv file into a time series . . . . . . . . . . . . . 27
3.3 Overfitting and Underfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Data separation into Training and Test sets . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Pseudo-code for the ARIMA implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Transformation of a time series into a supervised learning format. . . . . . . . . . . . . . . 31
3.7 Cross-Validation with K=3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Pseudo-code for feature selection and hyper-parameters tuning . . . . . . . . . . . . . . 32
3.9 MAE calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.10 ROI calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.11 Sharpe Ratio calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.12 Accuracy calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.13 Evaluation metrics calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 VRSN Stock Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 BEN Stock Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Results of Dickey-Fuller Test for the original series . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Results of Dickey-Fuller Test for the stationary series . . . . . . . . . . . . . . . . . . . . . 40
4.5 Results of Dickey-Fuller Test of the original series . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Results of Dickey-Fuller Test of stationary series . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Comparison of the three algorithms in a clear trend stock . . . . . . . . . . . . . . . . . . 53
4.8 Comparison of the three algorithms in a sideways stock . . . . . . . . . . . . . . . . . . . 54
4.9 KNN with Retraining for a Clear Trend Stock . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.10 KNN with Retraining for a Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xv
4.11 SVR with Retraining for a Clear Trend Stock . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.12 SVR with Retraining for a Sideways Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xvi
Chapter 1
Introduction
This chapter describes the motivation behind this work and the problem to be solved. After understand-
ing the context of the problem, the objectives of this thesis are enumerated as well as a few contributions
resulting from all the conducted research. At the end of the chapter, the document structure is described.
1.1 Motivation
Stock market refers to the collection of markets and exchanges where trading securities takes place and
it is considered one of the most vital components of a free-market economy. It is known that the first stock
exchange happened in 1531 in Belgium, even though the concept of “stock” has changed over time, in
the beginning it was similar to a financier partnership that produced income like stocks do. The stock
exchange started in London officially in 1973, and 19 years after was the first New York Stock Exchange.
Stock market news are everywhere: newspapers, TV news, and there are complete websites dedicated
to this matter. The reason why the stock market is so important is that allows companies to raise money
by offering part of their equity, letting the investors participate in their financial achievements. Also, stock
market serves as an economic barometer since share prices rise and fall depending largely on economic
factors. For example, share prices tend to increase when the economy shows signs of growth, and in
the other hand tend to brutally decrease, sometimes leading to a stock market crash, during economic
recession, depression or financial crisis. Having a good knowledge about the stock indexes serves as
a reference to the general trend in the economy, influencing decisions from the average family to the
wealthiest executive.
Going back to the fact that stock market gives the opportunity to small investors participate of the
company financial achievements, the idea of owning shares of a big company is very appealing, leading
loads of people to invest in this market. Taking this into account, the investor wants to own shares of a
wealthy company with expectations of future returns, and not of a company that will decrease its value
in the future. This lead to the need of pondering the decision of which stock one should invest. To
solve this impasse, the concept of predictive analysis, also known as forecast, started to appear in the
financial field. The definition of “forecast” is to predict or estimate a future event or trend based on past
1
and present information. In this case, this future event is a future value of a share, in order to decide
either to buy it or not. The idea of learning from the past in order to predict the future gained popularity
in the last years and a lot of techniques concerning this topic are now being used and tested to make a
good prediction helping those involved in this market.
There are statistical models to do this predictive task and they started to give very good results.
Statistics deals with the collection, classification, analysis, and interpretation of numerical data and
provides tools for forecasting through statistical models. Statistical models started to be almost all from
the class of linear models but the data behavior, especially financial data, caused a particular interest in
nonlinear models. Also because of the non-linearity of the data, artificial intelligence methods gained a
lot of popularity in the forecasting field. The term “Machine Learning” is being widely used in financial
computation, although the origins of machine learning are from the 50’s. In 1952, Arthur Samuel wrote
the first computer learning program, a checkers game where the IBM computer improved the more it
played, studying which moves made up winning strategies and using it into its program to win. It was one
of the first time that was created a kind of an “artificial intelligence”. In 1957, Frank Rosenblatt designed
the first neural network, the beginning of one of the most powerful machine learning algorithms used
nowadays. During the 90’s, machine learning shifts from a knowledge-driven to a data-driven approach.
Programs were being created for computers to analyze large amounts of data and draw conclusions
— or “learn” — from the results. Today we can even talk about “deep learning”, the ability to see and
distinguish objects and text in images and videos [1]. Computers’ abilities to see, understand, and
interact with the world around them is growing at a remarkable rate, and a lot of traders are using this
ability to forecast in the stock market.
This work is then motivated by the challenge of conducting predictive analysis in the stock market
using statistic and machine learning algorithms to improve the returns of a trader.
1.2 Objectives
The main purpose of this work is to introduce the use of forecast tools in financial computation field,
implementing a statistical algorithm and two machine learning algorithms to predict future prices of a
share in the stock market. After the implementation, the results of each model will be evaluated with
different metrics and compared based on that evaluation. The tests will be conducted with different
data volumes and for different time frames. At the end of this work, it will be possible to see how these
models behave and to choose the best of them for different situations. Briefly, the main objectives are
enumerated below and illustrated in Figure 1.1:
1. Study time series and their role in the stock market;
2. Introduce the use of forecast tools in financial computation;
3. Use one statistical algorithm to forecast future prices of a share in the stock market;
4. Use two machine learning algorithms to forecast future prices of a share in the stock market;
2
5. Compare the behavior of the three different models for a daily, weekly and monthly forecast.
Figure 1.1: Problem to be solved and questions related to each one of the main topics.
This work pretends to give a contribution in the field of statistical and machine learning algorithms for
forecasting financial data and the main contributions of this research are the following:
1. Give an idea of how to evaluate a forecast algorithm in financial computation;
2. Show the behavior of different algorithms when forecasting in the stock market;
3. Serve as a base for a future Big Data platform integration;
4. Serve a base to future forecast tests in financial computation.
1.3 Document Structure
This document contains five chapters in its structure. Chapter 1 is the introduction of the problem
being solved in this work as well as the objectives definition. Chapter 2 is the background and related
work. In its first part introduces some theoretical background about the context of the problem (stock
markets) and the algorithms used along the work; in its second part presents works related to the
forecast topic, most of them in the stock market’s context. Chapter 3 contains a description of the
proposed architecture to solve the problem of forecasting in the stock market and the logic behind its
step by step implementation. Chapter 4 contains some case studies serving as an evaluation context to
each of the implemented algorithms. It is also in this chapter that the algorithms are evaluated (for each
case study) and compared to each other. The last chapter, Chapter 5, contains conclusions about the
conducted work and some thoughts about future work that can be done in this field of research.
3
4
Chapter 2
Background and Related Work
2.1 Background
In order to make a better understanding of the problem, this section intends to describe some important
terms, techniques, and technologies. Going along the problem definition, there are three main topics:
the stock market data as a starting point, the forecasting techniques that can be used with this data and
finally the results discussion of this application. The questions enumerated in the objectives, Section
1.3, Figure 1.1, are answered in this section.
2.1.1 Concepts of Stock Market
The stock market is all about companies. Companies have assets, being an asset something that
has value and gives some type of future benefit. In a company context, assets can be the sum of
cash, buildings, inventories, copyrights, etc. On the other hand, companies have liabilities, a quantity of
money that they owe to some entity. The remaining between the assets and liabilities, in other words,
what is left after paying the liabilities, is what is called “owner’s equity”. When someone buys a share,
he or she becomes a partial owner of the company, more specifically, part-owner of the owner’s equity.
For future clarification, the difference between the terms “share” and “stock” is that a share is referring
to a specific company, while stocks may refer to one or more companies. Supposing a company has
a certain number of shares, the value of each share is the value of the owner’s equity divided by the
number of shares, and it is this share that is being sold in the stock market. To better understand this
scenario, an example is given below. Company X has $30m of assets, $22m of liabilities, $8m of owner’s
equity and 2m shares. The value of each share is the value of the owner’s equity divided by the number
of shares, so it is $4. Very often this value does not correspond to the selling price of the same share
in the market: sometimes the market presents a higher value, other times a smaller value because of
speculation among other reasons. The “market capital” is what the market thinks the equity is, reflecting
it in share prices.
Imagining that, for some reason, a trader can make an informed guess about what will be the price
of a specific share in the day after. The question here is: how can a trader make money with this
5
information? Supposing that the trader has strong reasons to believe IBM shares will increase tomorrow
and with that information he or she decides to buy one or more IBM shares. If the prediction is right, on
the day after the trader owns a share that has more value than when was bought. This action is called
a “long position”, and it is commonly addressed as “going long”. When owning this shares, if the trader
has strong reasons to believe the price will start to decrease, he or she can sell its shares and end its
long position with a positive profit, since the share was sold for a higher price than it was bought.
If instead the trader has strong reasons to believe that IBM shares will fall down the day after, for
example from $100 to $50, and the trader does not own any share of IBM, he or she can borrow a share
from the broker and sell it on the market for its current value, $100. After this step, the trader owns $100
in cash and owes one share to the broker, since the sold share was borrowed. If the share price drops
to $50 in the day after, the trader can now buy this share and return it to the broker, having a profit of
$50 at the end of the trade. This is what is called a “shorting a stock”.
Resuming, in a long position the main goal is to buy a share by a low price to sell it at a higher price,
and in a short position, the goal is to sell high to buy low.
This strong guesses that traders have concerning future share values are the base of trading strate-
gies. There are two main trading strategies influencers used nowadays – fundamental and technical
analysis. Fundamental analysis focuses on the fundamentals of the company or industry, i.e. data such
as sales and debt level, which of course are affected by the macroeconomic environment. It tries to go
to the facts and numbers of each company. This kind of research uses economic reports, internal doc-
uments, and even public news. Technical analysis is a method of analyzing the statistics generated by
market activity to evaluate securities. It relies on three hypothesis 1) the market discounts everything, 2)
price moves in trends and 3) history tends to repeat itself [2]. Technical indicators are operations based
on the price and volume of a security that measures money flow, trends, volatility, and momentum.
Stock market price sequences are available on multiple web platforms like Yahoo, Google Finance,
or OANDA, and there is an enormous quantity of data available. This data can be interpreted as Time
Series and it is important to understand some specificities of time series to know how to use them for
data analysis and forecast.
2.1.2 Time Series characteristics
A univariate time series is a sequence of measurements of the same variable collected over time where
ordering matters due to the dependency over the past. Figure 2.1 shows an example of a time series.
Time series can have multiple behaviors such as trends, seasonal periods and they can even show
a random walk behavior. The existence of a trend means that, on average, the variable measured tend
to increase or decrease. One example of this can be the number of people using cellphones measured
in the last 10 years. On the other hand, seasonality is a regular pattern related to calendar seasons
and can be observed, for example, in time series representing the percentage of rain in the last 5 years.
Some measurements seem to have a random walk behavior, almost like white noise. Most of them can
6
Figure 2.1: Example of a time series.
be decomposed in order to identify if they have some trend or seasonality that is not obvious just by
looking [3].
Identifying these characteristics is very important when considering statistical analysis since station-
ary time series are easier to work with. A stationary time series is one with constant mean and variance.
With this definition, it is possible to conclude that an uptrend time series is not stationary.
Modeling and analyzing a time series is nothing more than finding a mathematical model that can
describe the time series values over time. This analysis can perhaps explain how the past is affecting
the present and the future of the values, to forecast future values and to serve as a control standard.
2.1.3 Modeling and Forecasting Time Series
Forecasting is the process of making predictions about the future based on data from the past. Stock
market traders use forecasting to predict the evolution of stock prices and take advantage of it to decide
a strategy, meaning either going long or short on a share. This can be made without any computer
program, being the economists the ones who analyze all sources of data trying to find relations and
patterns that are sufficient to make assumptions about the future. This approach became impractical
due to the amount of data that exists so a lot of methodologies were computed to facilitate this process.
A forecast is a prediction so it has a degree of risk and uncertainty attached to it. There are two main
forecasting approaches discussed in this work: statistical approach and machine learning approach.
The three methods described in this section are inserted in one of these two categories and enumerated
below:
1. Statistical approach: ARIMA model.
2. Machine Learning approach: Support Vector Machines and K-Nearest Neighbors.
a) Statistical Approach
A time series can be modeled and forecasted with a statistical equation. The more common statistical
methods are GARCH (generalized autoregressive conditional heteroscedasticity) and ARIMA (autore-
7
gressive integrated moving average). The first one is used mainly to forecast financial time series volatil-
ity, i.e. the periodic standard deviation. ARIMA fits the data itself and not the volatility and it is used to
forecast the actual time series [3]. For effects of this work, only the ARIMA model will be described in
this section.
a.1) ARIMA
ARIMA is a combination of an autoregressive model (AR) and a moving average model (MA) with one
or more orders of difference. ARIMA has three parameters, p, d, and q, and it is commonly presented
as ARIMA (p,d,q).
Starting by explaining what is an autoregressive model, it is important to understand what is an
autoregression. Autoregression is nothing more than a regression of the variable against itself, and the
expression of an autoregressive model is shown in Equation (2.1),
yt(AR) = µ+
p∑i=1
γiyt−i + εt. (2.1)
In this equation, yt(AR) is the variable to be predicted by the autoregressive model, µ is the average
of the changes between consecutive observations, γi are the coefficients of the lagged value, yt−i are
the lagged values of yt, p is the number of these lagged values, and εt is white noise. Multiple regression
uses a linear combination of predictors to forecast the variable of interest. Looking at the autoregressive
model equation, Equation (2.1), it is possible to observe that an autoregression is similar to a multiple
regression but with lagged values of yt as predictors. The number of lagged values that are used as
predictors is the value of p, being p one of the ARIMA parameters.
While an autoregressive model (AR) uses past values of the forecast variable in a regression, a
moving average (MA) model uses past forecast errors in a regression-like model. The expression of a
moving average model is presented in Equation (2.2),
yt(MA) = µ+ εt +
q∑i=1
θiεt−i. (2.2)
In Equation (2.2), µ is the average of the changes between consecutive observations, εt is white
noise, θi are the coefficients of the lagged forecast errors and q is the number of these lagged errors.
Looking at this equation, it is possible to observe that each value of yt can be thought of as a weighted
moving average of the past few forecast errors. The number of these past forecast errors, as referenced
before, is the parameter q of the ARIMA model.
The combination of differencing with these last two equations results in the ARIMA (p,d,q) expression
since ARIMA is an autoregressive integrated moving average model. The difference in a time series is
the series of changes from one period to the next. The ARIMA (p,d,q) expression, Equation (2.3), is
presented below where y′t represents the differenced series that may have been differenced more than
once:
8
y′t(ARIMA) = µ+
p∑i=1
γiy′t−i + εt +
q∑i=1
θiεt−i. (2.3)
The number of time that the series was differenced is the value of parameter d of ARIMA (p,d,q).
The reason why some series need one or more orders of difference is that they are not stationary and
ARIMA only works with stationary series. If after one order of difference the series become stationary,
the value of parameter d is 1”. If the original series is already stationary, there is no need of differencing.
Concluding, the description of the three parameters are resumed below:
1. p is the number of autoregressive terms;
2. d is the number of orders of difference to turn a time series stationary;
3. q is the number of moving average terms.
To forecast a time series with ARIMA, the values o p, d, and q should be calculated.
First thing when thinking about modeling a time series with ARIMA is to check the series stationarity,
i.e., if it has a constant mean and variance. A nonlinear transformation, for example a logarithmic
transformation, can convert the original series to a form where its local random variations have constant
variance over time. After this step, if the time series is still non-stationary, a first-difference transformation
can be applied until the series shows a constant mean. The number of orders of difference that is
needed to turn the original series into a stationary one is the parameter d, and its value is discovered in
the phase.
Now that the series is stationary and the order of differences, d, is known, there are still two pa-
rameters to discover: the parameter p that is the number of autoregressive terms, and the q moving
averages terms. To identify the number of AR and MA terms, a PACF (Partial Autocorrelation Function)
and an ACF (Autocorrelation Function) can be observed. One example of ACF and PACF function are
presented in Figure 2.2.
Figure 2.2: Example of PACF and ACF functions.
PACF and ACF are both measures of association between current and past series values. In the
ACF is presented the relation between yt and its lagged values yt−k, for different values of k. In the ACF
9
perspective, if yt is correlated with yt−1, then yt−1 and yt−2 are also correlated. This correlation may be
due to new information contained in yt−2 that could be used in forecasting the value of yt, or it can be
simply because they are both connected to yt−1. To solve this uncertainty, a PACF is conducted, since
PACF measure the relationship between yt and yt−k after removing the effects of lags 1, 2, 3, ..., k − 1.
Looking at Figure 2.2, the ACF shows a spike in the first lag and PACF shows the same, so is very
reasonable to believe that yt is strongly correlated with yt−1. Generally, the lag beyond which the PACF
cuts off is the indicated number of AR terms, and the lag beyond which the ACF cuts off is the indicated
number of MA terms.
b) Machine Learning Approach
The widely-quoted definition of Machine learning by Tom Mitchell [4] says “A computer program is said
to learn from experience E with respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with experience E”. Machine Learning algorithms
can then simulate a brain, in the way that given a set of information, algorithms can interpret and process
the information and take out some conclusions. In forecast, these algorithms are gaining a special place,
with very good results in a lot of fields. The idea is to use as an input of the algorithm a set of data (a
time series for example), train the algorithm with that data and present an expectation of future values
that this time series can take.
There are several types of machine learning algorithms being the main groups the supervised and the
unsupervised ones. Supervised Learning is a mapping function between input variables, also known as
features, and output variables, also known as targets, both of them known in the beginning [5], with the
goal of finding the relation/pattern between the input and output. The ideal result is a perfect matching
between these values in a way that given a new input set it is possible to have a reliable output. It is
called supervised because the correct output is known since the beginning and all the process can be
seen as successive attempts to get it right, being constantly adjusted to achieve the perfect mapping.
Supervised Learning problems can be grouped into classification problems when the output is a
category, and regression problems when the output variable is a real or continuous value.
b.1) K-Nearest Neighbors
When talking about forecasting, a lot of algorithms come out. One of the simplest is the Nearest Neigh-
bor algorithm. K-Nearest Neighbor (KNN) is a non-parametric method since it does not assume a linear
functional form for f(X), being more flexible than a parametric model [6] since it does not make any
assumptions on the underlying data distribution. Non-parametric models can be more complex to un-
derstand but they work better than linear models when dealing with a great number of observations.
Besides being a non-parametric method, is also called a lazy algorithm meaning it does not use the
training data points to do any generalization being the training phase very minimal and fast. KNN al-
gorithm is based on feature similarity and in regression is based on how much of the previous values
are considered similar to the out-of-sample data point. This number of previous values is the number of
10
neighbors.
Assuming a value for the number of nearest neighbors, K, and a point to be predicted at an instant
i, xi, the KNN algorithm identifies the K training observations Ni closest to the prediction point. The
estimation for xi is given by:
f(xi) =1
K
∑xiεNi
yi. (2.4)
In other words, to predict the value at time t, for example for a K = 10, the algorithm does the average
of the last ten values and assumes that this is the value at instant t. The optimal value for K depends
on the bias-variance tradeoff. Bias measures how far models predictions are from reality and variance
represents the variability of a model prediction for a given data point. A small K provides a more
flexible fit, which has a low bias but a high variance. On the other hand, a large value of K provides
a low variance since the prediction in a region is an average of several neighbors, and changing one
observation has a smaller effect.
The main advantages of KNN are the simplicity, high accuracy, quickness, and the fact that does not
make assumptions about the data. On the other hand, the prediction phase can be slow, and it can be
sensitive to irrelevant features.
b.2) Support Vector Regression
Support Vector Machines (SVM) are supervised learning algorithms used for classification and regres-
sion. Due to the context of this problem, Support Vector Regression is the one that matters, since
the goal is to predict a real value, but it is important to understand the way of work of support vector
machines in general. Support Vector Machines were invented by Vladimir N. Vapnik and Alexey Ya.
Chervonenkis in 1963 and since that time it has suffered some changes. The SVM algorithm is also
known as the widest street approach and the basic idea is described below.
Figure 2.3: SVM classification approach.
11
It is given a training dataset ((−→x1, y1), ..., (−→xn, yn)), where yi is the class to which −→xi belongs. Support
Vector Machine intends to find the maximum-margin hyperplane that divides the group of points −→xi that
belong to one class and the group of points −→xi that belong to the other class. Taking the example from
Figure 2.3 the −→xi values can be health indicators and yi can be the result of a health test for those health
indicators. The hyperplane then divides the group of health indicators that belong to type “positive for
disease” from the other group of indicators belonging to type “negative for disease”. The hyperplane is
then the decision boundary and it is as wide as possible, knowing it cannot contain any sample. There
are some samples on the margin and they are called the support vectors.
The hyperplane is then the decision boundary, but it is necessary to define a decision rule that uses
this decision boundary. In Figure 2.3, vector −→w is a vector of any length perpendicular to the median
line of the hyperplane, u is an unknown sample and −→u is the vector representative of that sample and
the decision rule has to give either if the unknown sample is on the left or the right side of the “street”.
Projecting −→u in the perpendicular vector −→w can give that information, meaning that if the dot product
between the two is greater than a constant, the sample belongs to the right side of the hyperplane, so if−→w · −→u ≥ c, u is a positive sample. Without loss of generality, and assuming c = −b, it is possible to say
that if the dot product plus some constant b is equal or greater than zero, −→w · −→u + b ≥ 0, u is a positive
sample. The problem with this equation is that there are no constraints to determine which particular −→w
and b to choose, so it necessary to add some constraints to calculate these values. Taking one positive
sample, −→x +, it is acceptable to insist that its dot product with −→w plus a constant has to be greater than
1,−→w · −→x + + b ≥ 1, and taking a negative sample, its dot product with −→w plus a constant has to be less
than -1, −→w · −→x − + b ≤ −1. This assumption is due to mathematical convenience.
With this last two equations, it is possible to calculate the two values of −→w and b but it is still a long
calculation, so another variable, yi, is introduced into the problem for mathematical convenience. This
variable takes the following values: 1 for positive samples and -1 for negative samples. Multiplying the
two last equations by yi, the two equations become equal to yi (−→w · −→x i + b) ≥ 1 that is equivalent to
yi (−→w · −→x i + b)− 1 ≥ 0, being yi (−→w · −→x i + b)− 1 = 0 for the samples above the margins.
Choosing one vector to a negative sample and another to a positive sample, the width of the street
could be described as the dot product between the difference of these two vectors, −→x+−−→x −, and a unit
vector perpendicular to the “street”. It is known that −→w is a perpendicular vector of any length, so divided
by its norm is a unit vector perpendicular to the hyperplane. The width is then equal to (−→x+ −−→x −)−→w‖w‖ .
When looking at this equation, −→x + · −→w and −→x − · −→w can be solved using yi (−→w · −→x i + b) − 1 = 0 being−→x + · −→ω = 1 − b and −→x − · −→w = 1 + b. Substituting this results in (−→x+ − −→x −)
−→w‖w‖ , the width of the
street is now updated to 2‖w‖ . The goal of all these calculations is to maximize the width of the decision
boundary and find the decision rule, in other words, maximize 2‖w‖ . For mathematical convenience, this
is the same as maximizing 1‖w‖ and the same as minimizing ‖w‖ and formally the problem of maximizing
the width of the hyperplane is mathematically the same as minimizing 12‖w‖
2. The original problem of
maximizing the width of the hyperplane is now reduced to Equation (2.5):
Minimize1
2‖w‖2 . (2.5)
12
Finding one extreme of a function with constraints is not obvious, and to solve this problem Vapnik
used Lagrange multipliers. The use of this multipliers gives a new equation to minimize without thinking
about the constraints of the problem. L is now what it has to be minimized and it is written in Equation
(2.6):
L =1
2‖−→w ‖ −
∑αi [yi (
−→w · −→xi + b)− 1]. (2.6)
To find an extremum of a function one should equal the derivatives of that function to zero, so in this
case, the derivatives of L in order to −→w and b should be equalized to zero and it is shown below:
dL
dw= w −
∑αiyixi = 0⇒ w =
∑αiyixi. (2.7)
dL
db=
∑αiyi = 0. (2.8)
Substituting (2.7) and (2.8) in (2.6) the result is the Equation (2.9):
L =∑
αi −1
2
∑αiαjyiyj(xi · xj). (2.9)
The conclusion is that the optimization depends only on the dot product of pairs of samples. Taking
this into consideration, the decision rule is given by the two equations below and it only depends on the
dot product of the sample vectors and the unknown sample vector:
If∑
αiyi−→xi · −→u + b ≥ 0, then it is a positive sample. (2.10)
If∑
αiyi−→xi · −→u + b ≥ 0, then it is a negative sample. (2.11)
The original support vector machines proposed by Vapnik in 1963 only covered situations where
it was possible to linearly separate the two classes of samples. To turn the algorithm able to solve
nonlinear problems, Vapnik allied a kernel trick and the dot product is then replaced by a nonlinear
kernel function allowing the algorithm to fit the hyperplane in a transformed feature space. The more
common kernels besides the linear one are the polynomial and the Gaussian radial basis.
The idea of using SVM for regression, more known as support vector regression, is very similar to
the idea of SVM for classification. In SVM version for classification, the model depends only on a subset
of samples because the model does not care about points out of the hyperplane margins. Analogously
in regression, the model depends only on a subset of the training data because the cost function for
building the model does not care about points close to the model prediction. Formally this problem can
be written as a convex optimization:
13
Minimize1
2‖w‖2
subject to yi − w · xi − b ≤ ε
w · xi + b− yi ≤ ε .
In the problem above, xi is a training sample, yi is the target value for that sample and w · xi + b is
the prediction for that sample. The constant ε is a free parameter that serves as a threshold meaning all
the predictions have to be within a ε range of the true predictions. The learning algorithm minimizes the
ε-insensitive loss function illustrated in Figure 2.4.
Figure 2.4: ε-insensitive loss function.
This function is zero for errors that do not exceed a tolerance margin between [−ε; ε]. By taking this
function as a reference, ignoring the errors less than ε.
2.1.4 Metrics
The goal of this work is to forecast values of stock prices. To evaluate the forecast results in a math-
ematical and theoretical point of view, there are some metrics that must be taken into account. Since
there is more than one model used to solve the problem, each one of them should be evaluated using
the same metrics in order to have a true comparison between them. In statistic models, and sometimes
in machine learning, errors are frequently used to measure precision and approximation to reality. The
most common error measures are described in Table 2.1.
The first thing to understand is the concept of error, also known as residuals. When modeling a time
series, the result is a line that best fits the data. This line can be linear, polynomial, etc. Even though it is
the line of best fit, the data points do not fall exactly on it, being scattered around. A residual is a vertical
distance between a data point and the regression line, and there is one residual value for each point. If
the data point is exactly over the line, the residual is zero, if it is above the line the residual is positive,
and if it is below the line the residual is negative. The sum of the residuals is always zero, and so it is
14
Table 2.1: Description of the most common metrics used in statistical works.
Error Measures Criteria Description
Mean Absolute Error MAE = 1n
∑|yj − yj |
Mean Absolute Percentage Error MAPE = 100n
∑ |yj−yj |yj
Mean Squared Error MSE = 1n
∑(yj − yj)2
Root Mean Squared Error RMSE =√
1n
∑|yj − yj |2
its mean. These residuals are often called errors, even though in this context a residual does not mean
there is something wrong with the analysis, and Equation (2.12) describes the mathematical expression
of an error:
e = y − y. (2.12)
Mean Squared Error (MSE) measures the difference between the prediction and what the actual obser-
vation. It is more commonly used to evaluate model performance. Root Mean Squared Error (RMSE) is
the square root of the average of squared differences between prediction and actual observation, and it
measures the standard deviation of the residuals, that is how concentrated the data is around the line of
best fit. It is analogous to MSE but it has the same units as the quantity being estimated, and this is the
reason why RMSE is more popularly used than MSE.
MSE and RMSE are more popular when evaluating the quality of a model fitting while MAE and
MAPE are more commonly used when measuring forecast errors in time series forecast. Mean Absolute
Error is the average of the absolute differences between prediction and actual observations where all
individual differences have equal weight. MAE also uses the same scale as the data being measured
and measures the average magnitude of the residuals without considering their direction. MAPE is the
percentage of the average of the absolute error between two variables [7].
Since this work intends to use forecast techniques to forecast in the stock market, errors by them-
selves do not give an absolute evaluation [8]. There are forecast results that can have very low values
of RMSE and MAE and generate losses for the trader. For example if a trader buy shares of a company
because he or she suspects that the share price will increase from $2 in the day after, and the actual
value happens to decrease $1, the error is not very significant but the prediction will create losses since
the trader bought a share expecting that it would be more valuable in the day after and it happened to
be the opposite. To evaluate if a prediction creates gains or losses, there are some common evaluation
metrics that need to be considered when talking about forecasting in the stock market. These metrics
are described in Table 2.2.
15
Table 2.2: Description of the most common metrics used in computational finance.
Metric Description
Return On Investment ROI = Gain of investment−Initial investmentInitial investment
Sharpe Ratio SR = Mean Return −Risk Free RateStd Return
Accuracy Accuracy = Nr. of right guessesTotal of trades
The Return on Investment (ROI) is a performance metric used to evaluate the efficiency of an in-
vestment, measuring the amount of return on an investment relative to the investment’s cost, and it is
calculated as shown in Table 2.2. The result is usually expressed as a percentage.
Since ROI does not measure how much risk is involved in producing that same return, Sharpe Ratio
is commonly calculated and usually is what the hedge funds want to maximize. Sharpe Ratio is one of
the most referenced risk/return measures used in finance and describes how much excess return one
receives for the extra volatility of holding a riskier asset. A possible scenario where the use of Sharpe
Ratio makes sense is the one when comparing an investor A that has a return of 15% and an investor B
that has a return of 12%. At first sight, it may seem that A is a better performer. However, maybe A took
a larger risk so B can have a better risk-adjusted return. For future interpretation, a ratio of 1 or better is
already considered as good, 2 or better is very good, and 3 or better is excellent.
Accuracy is also important when taking into consideration the results evaluation and it is simply the
ratio between the right guesses about the increase or decrease in the share price and the total number
of trades executed. It is also a very used metric in a lot of works about forecast in the specific context of
financial data.
When looking at the results of a work, one can not have the sensitivity to understand if it is a good
or bad result, or whether it is in line with current market return values. Analyzing the Bloomberg stocks
section, the average annual return in American indexes is 15.27%, in Europe, Middle East and Africa
indexes is 4%, and in Asia indexes, the average is 18,69%. For example, Google shares have an
average annual return of 11% and S&P500 an average of 14.19%.
2.2 Related Work
This section describes some works developed in the financial context and some models and algorithms
techniques used to predict financial time series.
16
2.2.1 Works about modeling and forecasting a time series
This section presents works concerning modeling and forecast in computational finance field using stock
and Foreign Exchange (FOREX) data.
a) Statistical Approach
Rounaghi and Zadeh [9] tried to model and forecast the stock value of 350 firms listed in London Stock
Exchange and S&P 500 from 2007 until the end of 2013 using an autoregressive and moving average
model (ARMA model). As a starting point, they verify that a forecast must have in consideration three
factors: 1) choice of the time periods (lags) that must be used as a base, 2) market trend and 3) the
prediction period. They applied monthly and yearly forecasting in both London Stock Exchange and
S&P 500 Index. To model monthly data from London Stock Exchange, and according to PACF and ACF
graphs, the model used is ARMA (4,4) because the lag beyond which the PACF cuts off is 4 and is
the indicated number of AR terms, and the lag beyond which the ACF cuts off is 4 and is the indicated
number of MA terms. The yearly data from London Stock Exchange shows an increasing behavior, being
necessary to eliminate this trend using the regression method. After this elimination, and according to
the PACF and ACF graphs, the chosen model is ARMA(3,3). To model monthly data from S&P 500,
and according to PACF and ACF graphs the model used is ARMA(4,4) because the lag beyond which
the PACF cuts off is 4 and is the indicated number of AR terms, and the lag beyond which the ACF
cuts off is 4 and is the indicated number of MA terms. Lastly, to model yearly data from S&P 500,
and according to PACF and ACF graphs, the model used is the ARMA (3,3) because the lag beyond
which the PACF cuts off is 3 and is the indicated number of AR terms, and the lag beyond which the
ACF cuts off is 3 and is the indicated number of MA terms. To measure the quality of the proposed
ARMA model, researchers used MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error),
MDAPE (Median Absolute Percentage Error), SMDAPE (Symmetric Median Absolute Percentage Error),
and MASE (Mean Absolute Scaled Error). All the equations are calculated with the following definitions:
Yt is the observation at time t=1, 2,. . . , n; Ft is the forecast of Yt; et is the forecast error (et = Yt − Ft);
pt =100etYt
is the percentage error and finally qt is determined using Equation 2.13:
qt =et
1n−1
∑ni−2 (Yi − Yi−1)
. (2.13)
The results show that medium and long-term forecasting of time series is possible in S&P 500 and
London Stock Exchange at the error level of 1%. Both markets are considered efficient and with fi-
nancial stability during periods of boom and bust. The statistical analysis of S&P 500 shows better
results against London Stock Exchange both in medium or long horizons. The analysis in London Stock
Exchange shows better results in medium horizons (monthly), outperforming the yearly results.
Vantuch et al. [10] also tried to predict future prices of the Microsoft stock (MSFT) using the ARIMA
model. The data is in a daily format and it has four years of length. To calculate the values of (p,d,q),
a Genetic Algorithm is used and the best found model was ARIMA (12, 2, 8). The Akaike Information
Criteria (AIC) and the Baysien Information Criteria (BIC) values of this model were compared with models
17
chosen without the help of the Genetic Algorithm. The Akaike Information Criterion is an estimator of
the relative quality of statistical models for a given set of data, and the Bayesian Information Criterion
is also a criterion for model selection among a finite set of models. The model with the lowest AIC and
BIC is preferred. The results considering the genetic algorithm and the ARIMA modelwith parameters
GA-ARIMA (12,2,8) showed a BIC of 458.6266 and an AIC of 400.4396 while an ARIMA (2,1,3) without
the Genetic Algorithm showed a BIC of 434.0470 and an AIC of 408.1205. The GA-ARIMA model did
not show significantly better results, as it is observable, being the GA-ARIMA AIC only slightly inferior
to the AIC of the ARIMA. Also, the tests with PSO optimization did not prove that the estimation of the
coefficients by PSO has significant importance in the ARIMA results, maybe because of the low number
of PSO iterations. PSO is more popular in parallel computing, where it can obtain better results.
Wu & Lu [8] used Neural Networks to predict future values of the S&P 500 Index. Their paper
compares Neural Networks performance against an ARIMA model and the Neural Network model out-
performed the ARIMA model only in a stable market. When dealing with volatile markets, the Neural
Network system only showed an accuracy of 23% and the ARIMA’s accuracy was 42%. The same
comparison was conducted by Kamruzzaman & Sarker [11] but applied to exchange rates. The Neu-
ral Networks were trained with back-propagation, scaled conjugate gradient, and back-propagation with
Bayesian regularization. The algorithm used technical indicators and outperformed the ARIMA model,
having an impressive accuracy of 80%. Considering this results, Gerlein et al. [8] stated that an ac-
curacy of 80% must be considered with care since only the best results are reported and in general
machine learning techniques do not present such high levels of accuracy.
b) Machine Learning Approach
Machine learning is becoming very popular in all kinds of prediction. Some authors see the forecasting
challenge as a simple classification problem where they only want to classify the future in classes such
as up, down or sideways trends. Others use this classification only as a first stage.
Mandziuk et al. [12] used Neural Networks to train data in order to do a prediction for a 5-day pe-
riod of EUR/USD trading. The data that is used in Forex has a limited time of applicability so the input
data in forecasting models should change and should be as diverse as possible. In order to choose a
suitable subset of input variables to train with Neural Networks, Mandziuk et al. [12] used GA to perform
the selection process out of a large pool of diverse data sources available. Supposing that a chromo-
some consists of N data sources, the Neural Network has N inputs, N/2 hidden layers, and one output
(forecasted change of EUR/USD on the following day). If the output is positive, it means a purchase
opportunity. Results were compared with MACD, MA and CONTINUE methods, three deterministic
methods, and also with an early version of the proposed model. Neuro-evolutionary model proved to
have more than 56% of correct decisions, 30% more than wrong ones. It has way more activity than the
rest of the algorithms in comparison. The weighted version has more than 111% of profit corresponding
to more than 25% of annual profit.
Yoo et al. [13], in their survey on machine learning techniques for stock market prediction, talk about
the use of Neural Networks, Support Vector Machines, and Case Based Reasoning. The researchers
18
admit that Neural Networks are gaining a lot of popularity in this field of study but they have some
related issues, such as the black box problem, meaning that it is not known the significance of each
variable neither it is possible to understand how the network produces future prices. Another problem
with Neural Networks is the overfitting problem since Neural Networks fit the data too well and lose the
ability of generalization. This can be due to many nodes in the networks or long periods of training. Yoo
et al. [13] also assume Support Vector Machines to be very interesting when applied to classification
and regressions tasks in time series prediction related to financial applications. Unlike Neural Networks,
Support Vector Machines are resistant to overtraining achieving a high generalization performance and
one of the main advantages is that it is equivalent to solving a linear quadratic problem, having a unique
and globally optimal solution while Neural Networks have the danger of getting stuck at local minima.
Despite this, when entering with event information such as web mining information, Neural Networks
show better results.
Kim [14] also stated some of the limitations of Neural Networks, such as the overfitting problem and
the local optimal solution, and tried Support Vector Machines to solve the problem of predicting future
prices in the stock market. Technical indicators are used in this solution and the prediction is done in
a way that the output only takes values “0” if next day’s index is lower than today’s index, and “1” if
the next day’s index is higher than today’s index. In this solution, Support Vector Machine outperforms
Backpropagation Neural Networks and Case Based Reasoning, with a hit ratio of 57.8313%, Neural
Networks with 54.7332%, and CBR with 51.9793%. This study ends with a proposal of a Support Vector
Machine hyper-parameters optimization and also with the conclusion that low accuracies are a common
and expected result when dealing with capital markets since there is no single model perfectly suited in
all market conditions. Tay and Cao [15] compared Support Vector Machines against back propagation
Neural Networks to forecast future contracts and the Support Vector Machines solution obtained better
accuracy (47.7%) than the Neural Networks solution (45.0%).
The same happened in the work reported by Chen and Shih [16] where the two techniques were
compared when applied to six Asian indices, with an accuracy of 57.2% for the Support Vector Machines
and 56.7% for the Neural Networks.
Putting Neural Networks aside and entering with K-Nearest Neighbors, Chen and Hao [17] pro-
posed a hybridized framework of the Feature Weighted Support Vector Machine (FWSVM) and Feature
Weighted K-Nearest Neighbors (FWKNN) to predict stock market indices. The FWSVM was used to
classify the technical indicators of the stock data and the output of the classification is either “1” or “-1”.
This value is then compared with the class label to compute the accuracy of the model. After this classi-
fication step, an FWKNN algorithm is used to find K nearest neighbors of the testing data and to evaluate
the mean of those neighbors to predict prices. The proposed model was applied to the Chinese stock
market (the Shanghai and Shenzhen stock indices) and the results were slightly better than for regular
Support Vector Machine with K-Nearest Neighbors approach (SVM-KNN) and sometimes even equal.
For Shanghai composite index, FWSVM-FWKNN is better than SVM-KNN for time horizons of 1, 5,
15, and 30 days, even though the results have very similar values of Mean Absolute Percentage Error
(MAPE) and RMSE. For a time horizon of 10 and 20 days, the values of MAPE are the same. When
19
observing the Shenzhen composite index results, the SVM-KNN and FWSVM-FWKNN are even more
similar.
Still within the scope of simpler algorithms, in his work with simple classifiers (instance-based classi-
fiers, decision trees, and rule-based learners), Barbosa [18] claimed outstanding financial results taking
advantage of the low computational requirements for both the training and the classifying process of
such algorithms.
Gerlein et al. [8] concluded in their work that models do not generalize well when using a large
dataset since points in time closer to the trading period that will be predicted are more likely to exhibit
similar conditions. For testing accuracy and cumulative returns, Gerlein et al. [8] got the best setup
with 1000 instances, retrained over 10 periods with five attributes. Even though the results were very
satisfying, with accuracies reaching 53.70% and cumulative returns reaching 156.82%, the period of
trading between 2007 and 2009 does not present good performance since it was the period of the
economic crisis. This may suggest that simple machine learning algorithms can be useful in times of
normal market conditions, but can be weak predictors in odd periods.
2.2.2 Works on forecast concerning Big Data
This section describes forecasting works that took into consideration the rising problem of Big Data.
Liu and Wang [19] found trading in financial markets a big data problem since large transaction data
for 120 futures contracts are produced every minute. Given this amount of data, the researchers decided
to store it in a distributed way using Hive, a high performance distributed database system. Various repli-
cas are stored to keep the information safe and reliable. After the storage step, the data is processed with
MapReduce. The control node distributes assignments while the computing nodes compute the features
(maximum, minimum, count, summation, mean value, sigma, median value and median absolute devia-
tion (MAD)) and train the Decision Tree with Support Vector Machine model proposed (DT-SVM). First,
the data is fetched from the distributed database and then is splatted into different groups according to
their time spams. From each group, features are extracted, being each group an input of the distributed
system. After this step, the number of values larger or smaller than the mean value, the median value,
and the number of values within 3-sigma rule are calculated. Data is classified as “1” if the price has
increased by a certain percentage, “-1” if it has decreased and “0” if remained the same. The hybrid
model, DT-SVM, is then used to train the data with the help of the statistical features. The need for a
hybrid model is to solve the imbalance data problem and to filter the amount of noise. Decision Trees
filter most of the noise and leave the data for Support Vector Machine in good quality. Support Vector
Machine handles then the complexity of data. The overall strategy is represented in Figure 2.5.
This model was compared to Bootstrap-SVM, Bootstrap-DT, and Back Propagation Neural Networks
(BPNN) and outperform these three strategies in precision rate, recall rate, and F-One rate. Using
timestamps of 60 minutes, the precision of this model is about 70%. The reason why this happens is the
20
Figure 2.5: Representation of the problem architecture (adapted from [19]).
use of a two-phase classifier to handle large amounts of information, taking in mind the imbalance and
noise that characterizes a big data platform.
Liu et al. [20] used clustering to improve NNs in order to forecast a financial time series. Clustering
is the task of grouping data points so that points within each cluster are similar to each other [21]. The
main purpose of this grouping is such that every group is used to train a corresponding neural network
for prediction and this way the model does not have to handle a big data group as an input to the neural
network. The clustering algorithm used in this research was fuzzy c-means and the neural network is
the RBF one, that is quicker in convergence and more precise in modeling than back propagation neural
network. To solve the data imbalance problem is used an experimental data preprocessing method that
handles the normalization and smoothing in one process. To evaluate the system, researchers adopt Av-
erage Absolute Error (E) and Trend Accuracy in Direction (TAD). The RBF Neural Network with clustering
was compared with the normal RBF in precision, efficiency, complexity and outlier detection. Consider-
ing precision, RBF Neural Networks with clustering are more precise, being the predicting value closer
to the real one; considering efficiency, smaller data groups generates less training times, reducing the
convergence time; considering complexity, clustering contributes to the complexity reduction because
each group has high similarity between their points; finally considering outlier detection, clustering can
efficiently detect outliers and keeping them away from the training.
2.2.3 Resume
This chapter introduced all of the terms that will appear in the next sections and gave a brief explanation
about them. More specifically, this section presented a description of what is a stock market, the tech-
niques used in this work to predict stock markets and how this prediction is evaluated. The analysis of
21
some related works is also presented in this chapter since it is important to know what has been done
in this thesis field of research. Table 2.4 is a resume of these related works.
Table 2.4 is a resume of all the referenced works and intends to show a comparison between them in
order to serve as base and start point when thinking about which techniques to use. Looking at this table,
it is possible to see Support Vector Machines, Neural Networks, and ARIMA model are the most popular
models. It is also possible to conclude that accuracy appears very often as an evaluation function, such
as ROI and RMSE/MSE. To compare each of the models, only the accuracy is used since is presented
in almost all of the works and is a good trade-off between error and profit, serving both of the goals of
forecasting close to the real value generating profit. In general, accuracies are between 50% and 60%,
as it is observable, and accuracies far superior to that values need to be carefully analyzed [8]. Support
Vector Machines and Neural Networks have the best accuracies compared to other models. Looking at
column “Period”, a lot of works use an average of five years of data, except [11] and [14] that used more
or less ten years.
Based on this comparison table and the analysis made in this section, another table, Table (2.3),
illustrates advantages and disadvantages of the principal techniques presented in the related works.
Table 2.3: Algorithm comparison based on the Related Work.
Model Advantages Disadvantages
NN
1. Capable of discover non-linear relation-ships makes it ideal for modeling non-linear dynamic systems;
2. One of the more accurate algorithms.
1. Overtraining problem, losing generality;
2. Black Box problem, not revealing thesignificance of each variable and theway they weigh independent variables;
3. Danger of getting stuck at local minima.
SVM
1. Training is equivalent to solving a lin-early constrained quadratic problem;
2. Does not have the black box and theovertraining problem;
3. The solution is relatively unique andglobally optimal.
1. Usually not so accurate as Neural Net-works.
ARIMA1. Good results for clear trends;
2. Often used in forecasting research.
1. Usually not so good in irregular series;
2. Less accuracy in multiple-period-aheadforecasting.
22
Tabl
e2.
4:R
esum
eof
Rel
ated
Wor
k.
Wor
kD
ate
Met
hod
Fina
ncia
lApp
licat
ion
Per
iod
Eva
luat
ion
Func
tion
Com
pari
son
Acc
urac
y
[9]
2016
AR
MA
Lond
onS
tock
Exc
hang
ean
dS
&P
500
2007
-20
13M
AE
,MA
PE
,MD
AP
E,
SM
DA
PE
,MA
SE
Mon
thly
vs.
S&
P50
024
.9%
[10]
2014
GA
-AR
IMA
Mic
roso
ftS
tock
Pric
eIn
dex
(MS
FT)
4ye
ars
BIC
,AIC
,MS
EA
RIM
A-
[11]
2003
NN
US
D/A
UD
,GB
P/A
UD
,JP
Y/A
UD
,S
GD
/AU
D,N
ZD/A
UD
,CH
F/AU
D19
91-
2002
NM
SE
,MA
E,D
SA
RIM
A80
%
[12]
2016
NN
EU
R/U
SD
2009
-20
14P
rofit
,Effi
cien
cy,
Wei
ghte
dE
ffici
ency
MA
CD
,MA
,C
ON
TIN
UE
-
[14]
2003
SV
MS
tock
KOS
PI
1989
-19
98A
ccur
acy
NN
,CB
R57
.831
3%
[15]
2001
SV
MS
tock
sC
ME
-SP,
CB
OT-
US
,C
BO
T-B
O,E
UR
EX
-BU
ND
,M
ATIF
-CA
C40
1992
-19
99A
ccur
acy
NN
47.7
%
[16]
2006
SV
MFu
ture
sN
K,A
U,H
S,S
T,TW
,KO
1984
-20
02M
SE
,NM
SE
,MA
E,D
S,
WD
SN
N65
.333
%
[17]
2017
FWS
VM
-FW
KN
NS
tock
sS
SE
,SZS
E20
08-
2014
MA
PE
,RM
SE
SV
M-K
NN
-
[18]
2011
IBC
,DT,
RB
LE
xcha
nge
Rat
esan
dS
tock
s20
07-
2009
MD
D,R
OI,
RM
DB
etw
een
each
othe
r21
.9%
[8]
2016
NB
,K
*mod
el,
C4.
5,LM
T,O
neR
Exc
hang
eR
ates
and
Sto
cks
2007
-20
13A
ccur
acy,
RO
IB
etw
een
each
othe
r53
.70%
23
24
Chapter 3
Proposed Architecture
The proposed solution intends to predict the future values of a share taking into consideration the past
values of the same share. One statistical algorithm and two machine learning algorithms will be used in
order to solve the proposed problem of forecasting in the stock market. This chapter will go deeper into
the architecture of this solution.
3.1 Architecture Design
This section describes the architecture design of the proposed solution for forecasting future prices in
the stock market. The module that is developed in this work was implemented in python and it can be
divided into three layers as shown in Figure3.1.
Figure 3.1: Proposed architecture.
The overall solution uses a statistical method, ARIMA, and two machine learning algorithms, SVR
and KNN, to solve the same problem, i.e. forecasting future prices of a share. To perform this prediction,
old price sequences are used as inputs to the algorithms. After fitting ARIMA and Machine Learning
25
algorithms to the data, the result of the forecast is evaluated using different metrics and evaluation
parameters. The algorithms are also compared with each other to take some conclusions about which
of them is better to solve the problem in different contexts. The choice of these three specific algorithms
was not only related to the state-of-the-ars but also with the motivation of trying three different methods
to solve the same problem. ARIMA was chosen because it is one of the most popular statistical models
and also it has a lot of documentation. In the machine learning field, since Support Vector Machines
were showing such good results in classification tasks [13], SVR was chosen to check the behavior of
Support Vectors for regression tasks. Finally, KNN is introduced in a representation of the simple lazy
machine learning algorithms to see if a simple non-parametric method can have good results in forecast
tasks.
In the architecture design, Data Layer is the one responsible for fetching stock prices series needed
as inputs to the Train Layer. After the data is fetched, it is divided into two groups: the training and the
testing set. The training data is used to train the algorithm in question (ARIMA or Machine Learning
Algorithm), and the testing dataset is used as a comparison term to the forecast results.
Train Layer is responsible for the hyper-parameters optimization and model training using the training
dataset. This layer is divided into statistical sub-layer and machine learning sub-layer since these two
approaches are considered and implemented. This separation is also observable in Figure3.1.
After the right model is chosen and trained, the forecast can start. The Forecast and Validation Layer
is then concerned with the forecast and its results, as well as their comparison. A uni-step (daily) and
multi-step forecast (weekly and monthly) will be presented using different evaluation metrics in order to
show a wide comparison between the statistical method and machine learning algorithms.
3.2 Architecture Implementation
3.2.1 Data Layer
Data Layer is the one responsible for treating the price sequences that will serve as an input to the Train
Layer. The two main reasons for this layer to exist are:
1. Transform the .csv files that contain the daily open, close, highest and lowest prices of a stock into
a time series with the close price;
2. Separate the data into training and test set.
The transformation of a comma-separated values file (.csv file) into a time series takes place in a
function that uses pandas library and the pseudo-code describing this small function is presented in
Figure3.2.
When having the data in a time series format, this data is divided into training and test datasets. The
reason behind this separation is because the algorithms cannot be evaluated with the same data that
26
Figure 3.2: Pseudo-code for the transformation of a .csv file into a time series
they were trained since it would lead to an overfitting problem and lost of generality. Overfitting is the
non-ability of a model to generalize to out-of-data samples. It usually happens when too many features
are considered and the model fits the training set with a cost function of almost zero. In the opposite,
underfitting is when the model does not follow the behaviour of the training set. Both situations are
problematic and for a better understanding they are illustrated in Figure3.3.
(a) Overfitting (b) Underfitting
Figure 3.3: Overfitting and Underfitting.
The training dataset is then used for learning and tuning the hyperparameters of each of the algo-
rithms, and the test dataset is used to assess the model performance of the algorithm in question. Taking
this into consideration, the percentage chosen for the training set was 80% and for the testing set 20%.
This procedure is illustrated in Figure3.4.
Figure 3.4: Data separation into Training and Test sets.
The Data Layer ends when a time series is ready to be used by the algorithms in question. After this
steps, the training dataset enters the Train Layer where each of the algorithms is trained using this set
of data.
27
3.2.2 Train Layer
The Train Layer, as the name suggests, is responsible for training each of the three algorithms developed
in this work using the training dataset. Training a model is applying it to a training dataset so that the
model can perceive hidden patterns and mapping relationships that help it perform well during the test
period. This layer is divided into two sub-layers, the statistical and the machine learning one. The
reason behind this separation is the existence of common steps in the implementation of the machine
learning algorithms that do not make sense in the implementation of a statistical model and vice versa.
In the statistical sub-layer the ARIMA model is implemented, and in the machine learning sub-layer, the
Support Vector Machine and K-Nearest Neighbor are implemented.
a) Statistical Sub-Layer
The only statistical method implemented is an ARIMA (p,d,q) model. To use the ARIMA model, the
python statistic library statsmodels.tsa.arima model is used, and to conduct some stationarity analy-
sis “adfuller” is imported from statsmodels.tsa.stattools. The statsmodels.tsa package contains model
classes for time series analysis, including ARIMA, as well as related statistical tests such as the Dickey-
Fuller test. It is built on top of the NumPy and SciPy, and it also integrates with Pandas.
The process of training the ARIMA (p,d,q) with the training dataset is described in three steps: 1)
analyze the dataset stationarity and infer stationarity if needed; 2) find the optimal ARIMA parameters p,
d, and q; 3) fit the best ARIMA (p,d,q).
Step 1: Check for stationarity
The first thing to check when applying the ARIMA model is the stationarity of the time series. Price
sequences are not usually stationary and they usually need some transformations to show a stationary
behavior. To check for stationarity, a Dickey-Fuller Test is conducted. For this test, the regression model
is estimated, y′t = α+βt+φyt−1+γ1y′t−1+γ2y
′t−2+...+γky
′k , where y′t denotes the first-differenced series
and k the number of lags to include in the regression. If the original series is non-stationary, then the
coefficient φ should be approximately zero. The null-hypothesis states that the data is non-stationary.
If φ < 0, the hypothesis is rejected meaning that it is stationary. Based on this stationarity test, two
things can be done: 1) stabilize the variance and 2) stabilize the mean. To stabilize the variance, a
non-linear transformation should be applied. In this case, this transformation is a logarithmic one. To
turn the mean a constant value, a first-difference is applied, meaning computing the differences between
consecutive observations. This process should be repeated until the series is stationary. Once the series
is stationary, the model can be trained using all the past prices until instant t.
Step 2: Finding the best order of p, d, and q
When having a series with constant mean and variance, the next step is the choice of the right
ARIMA model. Selecting the right order is finding the best combination of the three parameters (p,d,q)
for a specific training data set. The only thing that the train algorithm needs to know is the maximum and
minimum values that these parameters can take and these values are described in Table 3.1.
The search for the best ARIMA parameters can be done in various ways, but since the values of p
28
Table 3.1: ARIMA parameters.
Parameters Range
p [1,5]d [1,2]q [1,5]
and q are not usually greater than 5, a brute force search is used to find the best possible combination of
parameters. The ACF and PACF plots can help on this task giving an idea of which lags are correlated
and maybe reduce the search for their information. For each set of parameters, a model evaluation is
taken using the mean squared root (MSE) as choice criteria. The best combination is the one with the
lowest MSE.
Step 3: Fit the model
After choosing the best ARIMA model based on the lowest values of MSE, the model is fitted to
the data and future prices can be calculated. ARIMA uses all the information (prices) from the past as
inputs.
The pseudo-code for the ARIMA implementation is in Figure3.5.
Figure 3.5: Pseudo-code for the ARIMA implementation.
29
b) Machine Learning Sub-Layer
In this sub-layer, two algorithms are implemented: Support Vector Regression (SVR) and K-Nearest
Neighbors (KNN).
To use Support Vector Regression, the scikit-learn library is used. Scikit-learn is an open-source ma-
chine learning library for python and is sponsored by INRIA, Telecom ParisTech and by Google through
Google Summer of Code. From scikit-learn, the sklearn.svm algorithm implementation is used. Support
Vector Regression follows the logic described in Section 2.1.
The implementation of K-Nearest Neighbors is very similar to the Support Vector Regression imple-
mentation. The library used is scikit-learn, more specifically the sklearn.neighbors with KNeighborsRe-
gressor.
As the two used techniques are machine learning algorithms that use supervised learning, there are
some implementation steps in common. These implementation steps are described below and their
structure is the following:
1. Search the right number of features;
2. Tune the hyper-parameters of each model;
3. Fit each of the two models.
Step 1: Choosing features and targets
Initially, the dataset that enters this layer is in a time series format. Taking into consideration that
the two algorithms use supervised learning, it is necessary to transform the dataset into a format that
matches the supervised learning problem format. Supervised learning concept was introduced in Sec-
tion 2.1 and the short idea is to have input variables called features (X), output variables called targets
(y), and an algorithm to learn the mapping functions from the input to the output. The goal of this map-
ping function is to maximize the mapping so well that when having new input data (X), it is possible to
predict the output variables (y) for that data. Time series data can be phrased as supervised learning,
having features and targets, using previous time steps as input variables (features) and use the next
time step as the output variable (target).
The use of previous time steps to predict the next time step is called the sliding window method and
the number of previous time steps is called the window width. This sliding window method is the basis
of turning any time series into a supervised learning format, and once a time series is prepared this
way, any of the standard linear and nonlinear machine learning algorithms can be applied. This window
width corresponds then to the number of features. An example of a transformation of a time series into
a supervised learning problem with two features and one target is illustrated in Figure3.6. The (t − 1)
and (t− 2) are two features, and the Close Price is the target.
The window width, in other words the number of features, is not known in the beginning and has to
be calculated. This width can be of any value but there is one optimal width that maximizes the mapping.
To find this optimal width, all windows of width 5, 10, 15, and 22 are tested in the training dataset. These
values correspond more or less to one, two, three and four weeks of prices, since the stock market
30
t Close Price
1 102 203 304 405 506 607 708 80
t− 1 t− 2 Close Price
10 20 3020 30 4030 40 5040 50 6060 70 80
Figure 3.6: Transformation of a time series into a supervised learning format.
closes during weekends. The reason why the range is not bigger than 22 is that today’s price is more
influenced by the last 22 days than for the days before. For each iteration along the window width range,
the machine learning algorithm is trained.
Step 2: Tuning hyper-parameters
For each iteration along the window width range, the machine learning algorithm is trained to find
the best set of hyper-parameters for that window width. The search for the best set of hyper-parameters
is done by a RandomizedSearchCV, implemented by sklearn.model selection.RandomizedSearchCV.
This method is similar to a GridSearchCV, where a python dictionary is created using combinations of
the algorithm hyper-parameters, and then the dictionary combinations are tested and scored consid-
ering the resulted MSE. The difference between RandomizedSearchCV and GridSearchCV is that in
RandomizedSearchCV not all the dictionary combinations of hyper-parameters values are tried out, but
rather a fixed number of parameter settings is sampled from the specified distributions reducing the
computational effort. The number of parameter settings that are tried out can be chosen.
When calculating the best hyper-parameters, it is important to avoid train and validate the same data
since this can cause overfitting. The cross-validation method is used to split the data into training and
validation sets. The logic behind this method is simply dividing the data into k-folds, using “k-1” folds
for training and the remaining one for validating and cost calculation (MSE). The process is repeated
“k” times and the validation folds alternate in each one of the k rounds. Finished the “k” iterations,
the average cost of the validation sets is calculated. In the end, there is an average cost for each set
of parameters, and the one with the lowest cost is the chosen one. The illustration of this process is
described in Figure3.7. It is important to notice that these training sets are not the same as the one from
Figure3.4. The training set from Figure3.4 is represented in Figure3.7 as “All training dataset” and the
training and validation sets are partitions of that “All training dataset”.
Summarizing and concluding, for each of the window width the best set of parameters is calculated.
At the end of all the iterations through the window width range, the scores of the best parameters for
each of the windows are compared, and the one with the lowest MSE is the chosen one. The result is
then the optimal window width (number of features) and the parameter’s values.
In Support Vector Regression, the three hyper-parameters are described in Table 3.2, as well as the
31
Figure 3.7: Cross-Validation with K=3.
test range of each of the parameters. The C parameter is called soft margin. A small value of C allows
ignoring points close to the boundary, increasing the margin while a large value of C assigns a large
penalty to errors and margin error, so the margin is smaller in those cases. The decision boundary is
also affected by the kernel that is commonly either linear, polynomial or Gaussian, also known as Radial
Basis Function (RBF). The degree of the polynomial kernel and the width parameter of the Gaussian
kernel, gamma, influence the flexibility of the decision boundary.
Table 3.2: SVR parameters
Parameters Range
C 1, 10, 100, 1000
Kernel Linear, polynomial, RBFGamma 0.1, 0.01, 0.001, 0.0001
In K-Nearest Neighbors, there is only one hyper-parameter to tune and is described in table 3.3, as
well as its test range. The value of K is the number of nearest neighbors used to predict the value in
question.
Table 3.3: KNN parameters
Parameters Range
K [1, 50]
The process until this point is described in the pseudo-code of Figure3.8.
Figure 3.8: Pseudo-code for feature selection and hyper-parameters tuning.
32
Step 3: Fit the model
In this step, the model is fitted using the best window width and hyper-parameters calculated. For
example, for support vector machine, if the optimal number of features discovered is 10, and the optimal
hyper-parameters discovered are [C=1, gamma = 0.1, kernel=‘rbf’], SVR will use sequences of the last
10 prices to characterize the next instant, and tomorrow’s price will be predicted using a soft-margin of
1 and a Gaussian kernel with an inverse-width parameter of 0.1. For example for k-nearest neighbors, if
the optimal number of features discovered is 20, and the optimal hyper-parameter discovered is [K=10],
KNN will use sequences of 20 prices to characterize the next instant, and it will use the 10 closer
neighbors to predict tomorrow’s price.
3.2.3 Forecast and validation Layer
The last step for each one of the studied models is the validation of the predictions using the testing
dataset. There are four evaluation metrics calculated in this layer: mean absolute error (MAE), return on
investment (ROI), Sharpe Ratio (SR) and accuracy.
The mean absolute error is implemented with sklearn.metrics that contains the mean absolute error
function. This function takes as an input the actual values and the predicted values, calculates the
absolute deviation between each pair of predicted value and actual value, and computes the average of
these deviations. To accomplish this calculation, the predicted value of each trade is appended to a list
called “predictions” to serve as an input to this mean absolute error function. The pseudo-code of this
calculation is in Figure3.9.
Figure 3.9: MAE calculation.
To calculate the Return On Investment, a strategy must be implemented to enter and exit the market.
Since the focus of this work is forecasting prices and not optimizing strategies, a very simple strategy
is implemented, just to evaluate how well a strategy would work depending on the prediction made.
Depending on the prediction for the day after, one of two strategies is created in each trade:
1. If the predicted price is greater than today’s price, one should buy shares (go long), and a variable
named strategy takes the value 1;
2. If the predicted price is smaller than the today’s price, one should “short” the stock, and a variable
named strategy takes the value −1.
33
After the strategy definition, the profit for that same trade is calculated using today’s price as “cost of
the investment”, tomorrow’s actual price as “gain of the investment”, and multiplying it by the strategy
variable 1 or −1. For example, if the predicted price reflects an increase in the share price, the strategy
takes value 1, and if the price actual increases, for example, from $50 to $75, the profit is 0.50 times −1
and representing a positive profit. If on the other hand the price dropped to $25, the profit is -0.50 times
1 and is a negative profit since the trader was wrong about tomorrow’s price. If a “go long” strategy was
already taken and the strategy chosen is “going long”, the position is maintained. The same happens for
the short strategy. If the trader owns a share and the chosen strategy has the value −1, this is equivalent
to sell and immediately “go short” on that share. The pseudo-code of this calculation is in Figure3.10.
Figure 3.10: ROI calculation.
The Sharpe Ratio is calculated at the end of the test period using the accumulated profits calculated
along the trading period, dividing its mean by its standard deviation. So for that, a list called profits
containing all the profits per trade is used as an input to the Sharpe Ratio function. The pseudo-code of
this calculation is in Figure3.11.
Figure 3.11: Sharpe Ratio calculation.
Finally, the accuracy counts the number of times the strategy was in agreement with reality, in other
others, the number of times that the strategy chose long or short and the price actually raised or dropped
respectively. The pseudo-code of its calculation is in Figure3.12. A right guess generates a positive
profit, so the accuracy is the number of positive profits divided by the number of trades. To calculate the
number of right guesses, each time a positive profit is calculated, a variable that counts the number of
right guesses, (nr right guesses), is incremented, and if the profit is negative nothing happens to that
variable, maintaining its value. The accuracy is then the result of this variable divided by the number of
trades (t).
34
Figure 3.12: Accuracy calculation.
These metrics are not individually calculated, even though the pseudo-code for each metric is shown
separately. This separation intends only to explain the implementation of the specific metric in ques-
tion for a better understanding and replication. Figure3.13 shows the integration of all these metrics
calculation in the prediction period.
Figure 3.13: Evaluation metrics calculation
When validating the models, different stocks are considered as well as different trading periods in
order to evaluate the model’s performance in different situations.
3.2.4 Resume
This chapter presented the overall architecture. The implementation since the data layer until the results
evaluation is fully described, as well as pseudo-code for each of the algorithms.
35
36
Chapter 4
Results
This chapter describes the performance of each of the three implemented algorithms following the ar-
chitecture described in Chapter 3 in the task of forecast future prices of a quoted stock. Since stocks
can have very different behaviors, two different stocks were chosen: one with a clear increasing trend,
and one considered to have no trend, having ups and downs. The three algorithms are tested in this two
stocks with a test period of 1 year. The stock market is closed during the weekend so a trading period
of 1 year corresponds more or less to 251 trading days, 1 week corresponds to 5 trading days, and 1
month to 22 trading days. The chapter is divided into five sections as enumerated below:
1. ARIMA performance: this section describes how ARIMA behaves with a clear trend stock and
with a sideways stock, starting by the process of the turning the series stationary and ending with
a conclusion of the obtained results and a comparison with Buy&Hold (BH) and ARIMA related
works;
2. KNN performance: this section describes how KNN behaves also with a clear trend stock and with
a sideways stock, going through the calculation of its K parameter and ending with a conclusion of
the obtained results and a comparison with B&H and previous works related to KNN;
3. SVR performance: this section describes how SVR behaves with the same clear trend stock and
sideways stock, discriminating the hyper-parameters that were used by the algorithm for each
situation, ending with a conclusion of the obtained results and a comparison with B&H and state-
of-the-art;
4. Comparison of the three models taking into account the obtained results in the three previous
sections;
5. Studying the impact of retraining KNN and SVR.
Stocks with clear trends are usually easily identified because they have very strong increasing/decreasing
behaviors. Even though this information is clear just by looking, sometimes a time series can be difficult
to analyze so it is common to make a decomposition of the time series to check for this trend. This
37
decomposition results in the identification of a trend, a seasonal component and noise (the random vari-
ation in the series). The stock that is used as a “clear trend stock” is the VeriSign, Inc. (VRSN) between
a period of five years from 2013 and 2017. The graphic representation of the time series is shown in
Figure 4.1, as well as its decomposition in trend, seasonality and noise components, and it was obtained
with matplotlib. As it can be seen in Figure 4.1, the stock has a very strong and clear trend but does not
have any kind of seasonality since the variation in the seasonality is only of 0.0001 units.
Figure 4.1: VRSN Stock Decomposition.
There are some stocks that do not have a clear behavior. In general, neither they constantly increase
or decrease. These quoted companies have ups and downs in their growth and Franklin Templeton
Investments (BEN stock) is one of those sideways stocks. The graphical representation of this stock is
shown in Figure 4.2 as well as its decomposition in trend, seasonality and noise component. As it is
observable, the trend component does not show any consistent uptrend or downtrend and the values for
seasonality are again very low and insignificant.
The three algorithms are then tested in these two different stocks and compared with each other also
considering the B&H strategy and the collected state-of-the-art. The Buy&Hold strategy, also know as
B&H, is a common strategy especially on the stock market that consists in buying a share and waiting
months or even years expecting that somewhere in time it would give some profit. The Short&Hold, also
know as S&H, consists in shorting a share and waiting months or even years expecting that somewhere
in time it would give some profit. The Random Walk theory suggests that shares prices take a random
and unpredictable path since they are independent of each other, so the past movement does not in-
fluence a future movement. For each case study, the evaluations metrics will be the one described in
Chapter 2.1: Mean Absolute Error (MAE), Return on Investment (ROI), Sharpe Ratio (SR) and Accuracy.
Also for each of the case studies, a daily, weekly and monthly forecast periods will be presented.
38
Figure 4.2: BEN Stock Decomposition.
4.1 ARIMA Performance
In this section, ARIMA is tested with a clear trend and a sideways stock. For each stock, a daily, weekly
and monthly forecast will be performed and the results will be discussed also taking into account the
B&H strategy. The ARIMA training set for both stocks corresponds to 4 years, from 2103-02-08 to 2017-
02-08, corresponding to 80% of the total dataset. The test set corresponds to 1 year, from 2017-02-09
to 2018-02-07. This separation is illustrated in Table 4.1.
Table 4.1: ARIMA data.Parameters Range
Training Set from 2103-02-08 to 2017-02-08Test Set from 2017-02-09 to 2018-02-07
Before applying ARIMA to any kind of stock, the stationarity of the series must be assured. After the
time series becoming stationary, the model can be fit and the predictions can be made. ARIMA uses all
previous data during the training phase and the same happens in the test period. For example, if the
goal is to forecast tomorrow’s price, and there is data available from the last 5 years, ARIMA uses all that
data during the training phase and when forecasting tomorrow’s price, all the data serves as an input.
4.1.1 Stock with a Clear Trend
This subsection describes the ARIMA performance with a clear trend stock, the VeriSign, Inc. (VRSN)
stock. The time series representing VeriSign, Inc. (VRSN) is not stationary since it has an increasing
behavior and a stationary time series must have a constant mean and variance, as explained in the
ARIMA background (Section 2.1). This can be proven taking a Dickey-Fuller test since sometimes can
be hard to be sure about the stationarity of a time series just by looking at it. The results of this test are
described in Figure 4.3.
39
Figure 4.3: Results of Dickey-Fuller Test for the original series.
A stationary time series must have a Test Statistic smaller than the Critical Values and this is not the
case since by looking at the Dickey-Fuller test of Figure 4.3, it is possible to see that the Test Statistic
is higher than the Critical Values. Two solve this problem, two things have to be done: stabilize the
variance, and stabilize the mean. To stabilize the variance, a logarithmic transformation is done and
the first-differences are calculated in order to turn the mean a constant. After these transformations, a
second Dickey-Fuller test is conducted, Figure 4.4, to see if the series is now stationary. The Dickey-
Fuller test from Figure 4.4 confirms that after the transformations the series is stationary, since the Test
Statistic is lower than the Critical Values, and the ARIMA model can be now be applied.
Figure 4.4: Results of Dickey-Fuller Test for the stationary series.
Next step is to find the best ARIMA order to predict daily, weekly, and monthly prices. All three
approaches use the same stationary time series, so this step above is done only once.
For the daily forecast, 251 days are predicted and, for each prediction, a strategy is calculated. In
order to find the best combination of parameters, all combinations of ARIMA (p,d,q) were tested, with p
and q with ranges between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were
made in order to predict the best combination based on the lowest MSE. The result of this brute search
is the ARIMA (2,1,0), with two autoregressive terms, one order of difference (like expected) and zero
moving average terms.
For the weekly forecast, 49 points are predicted and, for each prediction, a strategy is calculated.
When forecasting more than one value ahead with ARIMA, this number of out-of-sample is described
by the parameter ”steps” of ARIMAResults.forecast() function. In the weekly situation, the number of
steps is 5. The output of the forecast function is the 5th out-of-sample point. In order to find the best
combination of parameters for the weekly forecast, again all combinations of ARIMA (p,d,q) were tested,
with p and q with ranges between 0 and 5 and dwith a range between 0 and 2. In total, 50 executions were
40
made in order to predict the best combination based on the lowest MSE. Now, this MSE is relative to the
weekly forecast values. The result of this brute search is the ARIMA (0,1,3), with zero autoregressive
terms, one order of difference (like expected) and three moving average terms.
Finally, the monthly forecast is a prediction for the next 22 days and in the end 11 points are predicted,
and for each prediction, a strategy is calculated. In the monthly situation, the number of steps is 22 and
the output of the forecast function is the 22nd out-of-sample point. In order to find the best combination
of parameters for the monthly forecast, all combinations of ARIMA (p,d,q) were tested, with p and q with
ranges between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were made in
order to predict the best combination based on the lowest MSE. Now, this MSE is related to the monthly
forecasted values. The result of this brute search is the ARIMA (0,1,3), with zero autoregressive terms,
one order of difference (like expected) and three moving average terms.
The ARIMA performance results for forecasting daily, weekly and monthly future prices of a clear
trend stock, in this case the VRSN stock, are discriminated in Table 4.2.
Table 4.2: ARIMA results for a clear trend stock.Forecast MAE ROI Sharpe Ratio Accuracy
Daily 0.695 34.5% 1.906 57.8%
Weekly 1.256 36.7% 2.752 67.3%
Monthly 2.025 40.7% 3.556 90.9%
The best results in terms of error correspond to the daily forecast, and the best results of returns,
sharpe ratio and accuracy correspond to the monthly forecast.
Starting with the mean absolute error, it is reasonable to think that for a daily forecast the error should
be lower since all previous time steps are known and real. On the other hand, when forecasting for more
than one day ahead, there is a gap between the last real known price and the predicted price. This gap
is 5 in a weekly forecast and 22 in a monthly forecast so it reasonable to conclude that multi-step-ahead
forecast reduces the quality of the predictions, increasing the deviation from the actual values reflected
on a greater MAE.
Concerning the returns, as explained in the metrics section of the background, they are calculated
based on a simple strategy that relies on the forecasted values. Briefly, if the predicted price is higher
than the last known value, the strategy is ”going long” on that share; if the predicted price is lower
than the last known value, the strategy is ”going short” on that share. It is important to compare these
return with the B&H strategy to be able to frame the results in a fair context. The B&H strategy for this
specific stock gives a return of 31.2% in the same period as the test period of this work. ARIMA gives
better results in all three cases with returns superior to 34%, even though the implemented strategy is
very simplistic. This means that after one year, a trader entering the market with $1000 and using this
strategy based on the ARIMA predictions, can have more $340 in the end. The returns are better in the
monthly forecast, having returns of 40.7%. This is probably explained by the reduced number of trades
that are executed in a monthly forecast: only 11 compared to the 251 of the daily case. Reducing the
41
number of trades, even though the error is more evident in this case, the returns are higher because
there are fewer opportunities to fail the strategy. The same happens for the weekly forecast.
For the same reason, the accuracy of the monthly forecast is higher, since there are fewer opportu-
nities to fail in the strategy. The accuracy divides the number of right guesses by the total number of
trades, and in multi-step-forecast, the number of trades is reduced to 11.
Sometimes there can be strategies that lead to very optimistic returns but with a very high risk. This
is not ideal and that is why the hedge funds often want to maximize the return while minimizing the
risk, sometimes preferring strategies that lead to lower profits but have a more comfortable risk. In this
context, the sharpe ratio should be analyzed since reflects a trade-off between return and risk. In this
case, the sharpe ratio follows the returns and has very good values in all three ranges. More than 1 is
already considered as good, and more than 2 is considered to be very good, confirming that the ARIMA
results are indeed good results.
4.1.2 Sideways Stock
This subsection describes the ARIMA performance with a sideways stock, Franklin Templeton Invest-
ments stock (BEN). This series does not seem to be stationary since it does not show a constant mean
and variance. The results of the Dickey-Fuller test are described in Figure 4.5. Although the Test Statis-
tic value is not as high as the Test Statistic value for the VRSN stock, it is still greater than the Critical
Values.
Figure 4.5: Results of Dickey-Fuller Test of the original series.
In order to have a stationary time series the variance should be stabilized and so is the mean. To do so, a
logarithmic transformation is applied and the first-differences are calculated. After these transformations,
a second test, Figure 4.6, is conducted to see if the series became stationary. The Test Statistic is now
much lower than the critical values. Comparing with the clear trend stock, this stock shows more signs
of stationarity according to the Dickey-Fuller test.
Next step is to find the best ARIMA order to predict daily, weekly, and monthly prices using this
sideways stock. All three approaches use the same stationary time series.
For the daily forecast, 251 days are predicted and, for each prediction, a strategy is calculated. In
order to find the best combination of parameters, all combinations of ARIMA (p,d,q) were tested, with p
42
Figure 4.6: Results of Dickey-Fuller Test of stationary series.
and q with ranges between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were
made in order to predict the best combination based on the lowest MSE. The result of this brute search
is the ARIMA (3,1,3), with three autoregressive terms, one order of difference (like expected) and three
moving average terms.
For the weekly forecast, 49 points are predicted and, for each prediction, a strategy is calculated.
In order to find the best combination of parameters for the weekly forecast, again all combinations of
ARIMA (p,d,q) were tested, with p and q with ranges between 0 and 5 and d with a range between 0 and
2. In total, 50 executions were made in order to predict the best combination based on the lowest MSE.
Now, this MSE is related to the weekly forecast values. The result of this brute search is the ARIMA
(0,1,1), with zero autoregressive terms, one order of difference (like expected) and one moving average
terms.
Finally, the monthly forecast is a prediction for the next 22 days and in the end, 11 points are pre-
dicted, and for each prediction, a strategy is calculated. In order to find the best combination of param-
eters for the monthly forecast, all combinations of ARIMA (p,d,q) were tested, with p and q with ranges
between 0 and 5 and d with a range between 0 and 2. In total, 50 executions were made in order to pre-
dict the best combination based on the lowest MSE. Now, this MSE is related to the monthly forecasted
values. The result of this brute search is the ARIMA (0,1,1), with zero autoregressive terms, one order
of difference (like expected) and one moving average terms.
The ARIMA performance results for forecasting daily, weekly and monthly future prices of a clear
trend stock, in this case the BEN stock, are discriminated in table 4.3
Table 4.3: ARIMA results for a sideways stock
Forecast MAE ROI Sharpe Ratio Accuracy
Daily 0.362 -29.5% -2.034 44.6%
Weekly 0.776 -4.7% -0.318 44.9%
Monthly 2.230 -15.2% -0.908 36.4%
Analyzing only the error values, it seems that the predictions are very good and close to real values.
The error increases when forecasting weekly and monthly prices, but the values are still not extremely
deviated from the real ones.
Although this information may give an optimistic perception about the ARIMA performance in a side-
ways stock, the return values do not show the same optimism. For a daily forecast, the return on
investment is even negative, reaching -29.5%, corresponding to a huge loss of money. For example, a
43
trader entering the market with $1000 and using ARIMA daily predictions to decide a strategy will lose
approximately $300 at the end of one year. Besides this negative returns, the weekly and monthly fore-
cast gives higher values of ROI, -4.7% and -15.2% respectively. Comparing these results with the B&H
strategy, that gives -0.74% of returns for this stock in the same period as the test period, the ARIMA re-
turns do not show an improvement compared to this strategy, always leading to a higher loss of money.
The weekly forecast gives better results than the other two approaches, but taking into account the daily
results, even though the weekly results may seem better, it is possible that they are just influenced for
the fact that very few points are predicted.
In fact, the dispersion of returns for a given sideways stock, also known as volatility, is much higher
than the dispersion of returns for a clear trend stock. Commonly, the higher the volatility, the riskier the
investment. This fact is reflected in the sharpe ratio values that are very low since a good value has to
be 1 or greater. The sharpe ratio is even negative for all the situations. This is due to the high volatility
of the stock, turning the investment a riskier move.
The accuracy of the strategy based on the ARIMA predictions for this sideways stock does not have
very high values, never being greater than 50%. Accuracies less than 50% means that the algorithm
fails more often than it hits, being almost better not to take the strategy at all.
4.1.3 ARIMA performance conclusion
In this subsection, the results obtained are compared to each other and a conclusion is presented about
the ARIMA performance. For a better comparison, the implemented algorithm is compared taking into
consideration some related works about ARIMA models. The merged results for the two stocks, VRSN
and BEN, are represented in Table 4.4.
Table 4.4: ARIMA Performance.Clear Trend Sideways
Daily Weekly Monthly Daily Weekly Monthly
MAE 0.695 1.256 2.025 0.362 0.762 2.230ROI 34.5% 36.7% 40.7% -29.5% -4.7% -15.2%SR 1.906 2.752 3.556 -2.034 -0.318 -0.908
Accuracy 57.8% 67.3% 81.8% 44.6% 44.9% 36.4%
The best set of parameters corresponds to the monthly forecast for the clear trend stock, VRSN stock.
On average, the ARIMA results for the clear trend stock are better than the ones for the sideways stock.
Chan et al. [7] also stated that ARIMA model does not fit well at the beginning of a downward/upward
period, and that should be used when a clear trend is shown, such as the VRSN stock. Although this
appears to be true, the presented errors for the sideways stock show good results. The same cannot be
said of the sideways stock returns since in the clear trend stock situation the returns are always superior
to the one obtained by the B&H strategy while in the sideways stock all the returns are exceeded by the
B&H strategy.
44
One of the works referenced in the related work section conducetd by Rounaghi et al. [9] tries to
forecast the S&P500 and the London Stock Exchange with ARIMA using data between 2007 and 2013.
These works presents very small values for MAE, reaching the small value of 0.0283 for the monthly
forecast for S&P 500 index, but does not show any metrics relative to returns, risk or accuracy. Vantuch
et al. [10] predicted future prices of Microsoft shares and none of the predictions had errors inferior to
0.5, being the majority of the errors superior to 3 reaching values of 6 and 7. Again these predictions are
not applied to strategies and one can not have the sense of how much profit or loss these predictions
could lead, but the MAE results of this thesis approach is superior to the ones obtained by Vantuch et
al. [10].
Concluding, ARIMA performs better in a clear trend stock reaching very good profit and accuracy
results. When applied to a sideways behaves good considering only the error values, and very poorly
considering the returns since it is harder to fit when there are downward/upward moves. The MAE values
obtained are superior to the work conducted by Vantuch et al. [10].
4.2 K-Nearest Neighbors Performance
In this section, the K-Nearest Neighbors algorithm is applied to the same two stocks used to evaluate
the ARIMA performance. While the ARIMA model uses all the previous data as an input to predict
the future prices, with machine learning algorithms, and specifically with KNN, it is not like that. K-
Nearest Neighbors uses supervised learning, and so the first thing to do is to reframe the data into a
(features,target) format. There are four tested numbers for the features: 5, 10, 15 and 22. These values
correspond more or less to one, two, three and four weeks of prices that characterize the target. No
more than 22 features are tested since it is assumed that a price is more influenced by the prices that
are closer. The number of features is also referred along this work as window width.
After this reformulation, the algorithm only uses a set of previous values (targets) to predict the next
price. The number of previous prices that the KNN uses is the number of neighbors, represented as
“K”. It is important not to confuse the number of features and the number of neighbors. Features are
just prices that describe one price; for example, the price of today is described by the last five prices.
The number of neighbors is the number of past prices (targets) that KNN uses to predict the next price.
Besides these steps, the data is divided in the same proportion as it was in the ARIMA implementation,
being 80% for the training phase and the rest 20% for the test period.
4.2.1 Clear Trend
This subsection presents the KNN performance when a clear trend stock is used to forecast future
prices.
For the clear trend, the KNN is used to forecast daily, weekly, and monthly prices. For each of these
three options, the algorithm is tested using different values for the window width (features). The KNN is
then tested with 5, 10, 15, and 22 features. For each of these window widths, all values of K between
45
1 and 50 are tested using a grid search with 10-Fold cross-validation. This K parameter of the KNN
algorithm is the number of previous prices (neighbors) used to predict the next day price. This is why
it was said previously that KNN does not use all the past prices as an input to the forecast function,
contrary to the ARIMA models. These neighbors do not have the same importance and influence, being
the more recent neighbors the ones having a heavier weight.
For the daily forecast, the optimal number of features and neighbors founded during the training
phase was 5 features and 23 neighbors. For the weekly forecast, the best-founded model was for 5
features and 28 neighbors. Finally, for the monthly forecast, the resulted model after the training phase
had 5 features and 29 neighbors. The results obtained by these three models are described in Table
4.5.
Table 4.5: K-Nearest Neighbors results for a clear trend stock.
Forecast MAE ROI Sharpe Ratio Accuracy
Daily 10.838 -8.3% -0.591 44.6%
Weekly 12.073 -16.6% -1.568 40%
Monthly 16.080 -24.3% -2.754 18.2%
It is clear that the KNN does not have very good results in this clear trend stock forecast. The errors
for the daily, weekly and monthly forecast are very high and there are no positive returns. Even though
the daily forecast gives a better return, its value is way below the B&H strategy result of 32.2%. The
weekly and monthly forecast, both with 5 features respectively, are negative generating losses for a
trader that follows the strategy based on these predictions. Also, the sharpe ratio is very low, and the
highest sharpe ratio obtained for the daily forecast does not correspond to a good return. The accuracies
are all below 50% being very low for common accuracy values. One of the motives behind these poor
results is the behavior of this particular stock. This clear trend stock, VRSN, is almost always increasing
in price. The KNN does the average of the nearest neighbors, and if the prices are mostly increasing,
this average will most of the times correspond to a price that is lower than the actual price leading to a
wrong strategy and consequently to a negative return.
4.2.2 Sideways stock
This subsection presents the results of KNN for a sideways stock, meaning a stock that has uptrends
and downtrends mixed together.
For the sideways stock, BEN stock, the KNN is also used to forecast daily, weekly, and monthly
prices. Again, for each of these three options, the algorithm is tested using different values for the
window width. The KNN is tested with 5, 10, 15, and 22 features. For each of these window widths, all
values of K between 1 and 50 are tested using a grid search with 10-Fold cross-validation.
For the daily forecast, the optimal number of features and neighbors founded during the training
phase was 5 features and 16 neighbors. For the weekly forecast, the best-founded model was for 5
features and 29 neighbors. Finally, for the monthly forecast, the resulted model after the training phase
46
had 5 features and 22 neighbors. Weights are assigned to the chosen neighbors so that the nearest
neighbors contribute more to the average than the more distant ones. The results obtained by these
three models are described in Table 4.6.
Table 4.6: K-Nearest Neighbors results for a sideways stock.
Forecast MAE ROI Sharpe Ratio Accuracy
Daily 0.810 -28.4% -1.704 44.6%
Weekly 3.378 -1.5% -0.082 40%
Monthly 5.574 -15.2% -0.908 36.4%
For the sideways stock, the lowest error value corresponds to the daily forecast. The rest of the errors
grow with the forecast range reaching a value superior to 5 in the monthly forecast. The errors are not
ideal even though they are not so bad.
The returns are all negative and inferior to the B&H since the B&H strategy gives a return of -0.74%
for this specific stock in this specific period. The daily return is the lowest one, even though it corresponds
to the situation with the lowest error value, proving again that a good value for the error does not mean
a great return.
All the investments represent a very high risk since the sharpe ratios are all negative. The less risky
investment is the weekly forecast.
In general, the accuracies do not present very high results, being all less than 50%, being the weekly
accuracy the one with the lower value.
4.2.3 KNN performance conclusion
In this subsection, the KNN performance for the clear trend and the sideways stocks are compared and
analyzed in order to take some conclusions about the KNN behavior. The algorithm is also compared to
the state of the art in order to gain a context of the presented values. The KNN results for the stocks are
described in Table 4.7.
Table 4.7: KNN performance.
Clear Trend Sideways
Daily Weekly Monthly Daily Weekly Monthly
MAE 10.838 12.073 16.080 0.810 3.378 5.574ROI -8.3% -16.6% -24.3% -28.4% -1.5% -15.2%SR -0.591 -1.568 -2.754 -1.704 -0.082 -0.908
Accuracy 44.6% 40% 18.2% 44.6% 40% 36.4%
It is clear that the results obtained with the sideways stocks are better than the ones obtained for the
clear trend. Starting with the errors, the clear trend forecast presents very high values of MAE, being
the lowest error higher than 10. The errors for the sideways stock are always less than 6, being very
47
low compared with the clear trend errors. The error increases with the range of the forecast in both of
the stocks since KNN starts to use neighbors for its prediction function that are not actual values but are
also predictions.
The results for the returns are also better for the sideways stock, even though they are all below the
B&H strategy. It is important to remember that the B&H strategy gives 32.2% for the clear trend and
-0.74% for the sideways stock. It also relevant to say that even though the returns are a valid evaluation
metric for this work, they are calculated based on a very simplistic strategy.
The sharpe ratio is not considered good in any of the cases because it is always less than 1, meaning
that in the two stock situations the investments were risky. Finally considering the accuracies, the values
are very low and are all less than 50%.
Overall the KNN did not show very good results in the forecast task but shows a better performance
for a stock that does not have a clear trend. The reason why this happens is that KNN does the weighted
average of the last K neighbors and when a price is constantly increasing, this average will always be
less than the nearest neighbor, and consequently less than the actual price for the next day. On the
other hand, in a stock with a higher volatility, it is easier to have a more approximate result of the price
for the day after since the prices are always with up and down moves.
It is also important to observe that the results that are presented in table 4.7 correspond to a specific
model with a specific number of features and neighbors that were obtained during the training period.
For example, for the daily forecast of the clear trend stock, the presented results are the result of the
best combination between the number of features and number of neighbors, in this case 5 features and
23 neighbors. All the situations had an optimal number of features of 5, meaning that there is no need
to use a lot of features to discover the optimal model. It is also curious to observe that for almost all
the cases the number of neighbors is never higher than 29, meaning that to minimize the error it is not
needed to have a great number of neighbors.
Chen and Hao [17] also used KNN to predict future stock prices and presented their results con-
sidering the mean absolute percentage error (MAPE). MAPE can be problematic since it can cause
division-by-zero errors and it is not used as an evaluation metric along this work. However, in this
specific case, the MAPE is calculated in order to be possible to compare the results obtained by the
weighted KNN implemented in this work and the one implemented by Chen and Hao [17]. The results
of MAPE for the implemented KNN are shown in Table 4.8.
Table 4.8: KNN MAPE.Clear Trend Sideways
Daily Weekly Monthly Daily Weekly Monthly
MAPE - 0.218 1.547 - 0.153 1,258
In table 4.8, the entries with “-” are the ones with a division-by-zero error. Chen and Hao [17] obtained
a MAPE of 0.18 for a daily forecast, 0.22 for a weekly forecast, and there are no values presented for
the monthly forecast, probably because of the division-by-zero error. The only possible comparison is
48
then for the weekly forecast, and the values obtained in the implemented work show better results than
the ones obtained by Chen and Hao [17]. Even though the data sets are not the same, this comparison
gives an idea of how KNN performed in this work compared to the other. Dash and Dash [22] also used
KNN but in its classification form. It was used to generate buy and sell signals and took to very high
profits of 30% in BSE SENSEX data set. In this case, it is not possible to compare since in this work the
KNN was used as a regressor.
4.3 Support Vector Regression Performance
In this section, the Support Vector Regression algorithm is tested in a test data set of 1 year for a daily,
weekly and monthly forecasts. Such as K-Nearest Neighbors, SVR uses supervised learning so the
price sequence as to be reframed into a features and targets format. Again, the algorithm is tested
having 5, 10, 15, and 22 features. The SVR is trained for each of these window widths, and the best
hyper-parameters are calculated with a grid search using 10 fold cross-validation. The best combination
of hyper-parameters is chosen based on the lowest MSE and in the end, between the best sets of
hyper-parameters for each of the window widths the number of features having a lower MSE is the
chosen window width that will be used during the test period.
In SVR there are three hyper-parameters to tune: the kernel, the gamma and the C parameter.
The parameter ε is set to 0.1 since it is its default value . The kernel can be linear, polynomial or
Gaussian, also known as radial basis function (RBF). Contrary to the linear and polynomial kernels that
are considered parametric models, the RBF is non-parametric having a complexity potentially infinite
that can grow with the data, representing more complex relations outperforming the parametric kernels.
Even though the RBF kernel is promising, sometimes the linear and polynomial kernels give better
results, so it always advisable to test the three options. The soft-margin parameter, C, can takes values
[1, 10, 100, 1000], and the kernel parameter, gamma, can take values [0.1, 0.01, 0.001, 0.0001] if the
kernel is RBF. The parameters are calculated using a grid search with a 10-Fold cross-validation.
The data is divided in the same proportion as it was in the two last implementations, being 80% for
the training phase and the rest 20% for the test period. The algorithm is applied to the same stocks as
it were ARIMA and KNN: a clear trend stock and a sideways stock.
4.3.1 Clear Trend
This subsection presents the SVR performance in a clear trend for a daily, weekly and monthly forecast.
During the training phase, the C and the gamma parameters are optimized inside the range referenced
before.
For the daily forecast, the best results were obtained using 22 features, a soft-margin of 100, and
a polynomial kernel. In the weekly forecast, the optimal number of features is 15, the parameter C is
1000 and the kernel is polynomial. Finally, for the monthly forecast, the best results were obtained for
5 features with a C equals to 100, and an RBF kernel with gamma 0.01. The results obtained by the
49
Support Vector Regression algorithm in the clear trend are described in Table 4.9.
Table 4.9: Support Vector Regression results for a clear trend stock.
Forecast MAE ROI Sharpe Ratio Accuracy
Daily 4.552 23.41% 1.368 57.8%
Weekly 2.463 8.7% 0.725 53.1%
Monthly 13.004 -23.4% -2.166 27.3%
The results of SVR in this clear trend stock are not very optimistic. The errors are high for all the
forecast ranges. The returns are completely exceeded by the B&H strategy, and the monthly return
is even negative. The sharpe ratio is higher than 1 in the daily forecast representing a not so risky
investment, but the daily return of 23.41% is very low compared to the 33.2% of the B&H. The accuracy
of the daily case is very good compared to the others. The monthly accuracy is very low so maybe these
values should not be taken into account. Also in the monthly forecast, the kernel is the only radial basis
functions, since the other two cases, daily and weekly, use polynomial kernels.
4.3.2 Sideways Stock
This subsection presents the SVR performance in a sideways stock for a daily, weekly and monthly
forecast. During the training phase, the C and the gamma parameters are optimized inside the range
referenced before.
For the daily forecast, the best results were obtained using 10 features, a soft-margin of 1000, and a
polynomial kernel. In the weekly forecast, the optimal number of features is 5, the parameter C is 100
and the kernel polynomial one. Finally, for the monthly forecast, the best results were obtained with 15
features for a C equals to 10, and again an RBF kernel with a gamma of 0.01. The results obtained by
the Support Vector Regression algorithm in the sideways stock are described in Table 4.10.
Table 4.10: Support Vector Regression results for a sideways stock.
Forecast MAE ROI Sharpe Ratio Accuracy
Daily 0.807 1.9% 0.101 51.8%
Weekly 1.422 -4.4% -0.227 50%
Monthly 3.662 -1.5% -0.078 63.6%
Considering the errors, the daily forecast has an acceptable value of MAE, that increases in the
weekly, and again in the monthly forecast. The returns are also acceptable taking into account that is
a sideways stock and the B&H strategy for this stock is -0.74%. The weekly and monthly returns are
lower than -0.74%, having a negative value of -4.4% and -1.5%. The sharpe ratio indicates that all the
investments are risky since they are all less than 1. Only the monthly accuracy takes a good value being
higher than 60%.
50
4.3.3 Support Vector Regression performance conclusions
In this subsection, the SVR performance for the clear trend and the sideways stocks are compared and
analyzed in order to take some conclusions about the SVR behavior. The results are also compared
with works referenced in section 2.1. The SVR results for the two stocks are described in Table 4.11.
Table 4.11: Support Vector Regression performance.
Clear Trend Sideways
Daily Weekly Monthly Daily Weekly Monthly
MAE 4.552 2.463 13.004 0.807 1.442 3.662ROI 23.4% 8.7% -23.4% 1.9% -4.4% -1.5%SR 1.368 0.725 -2.166 0.101 -0.227 -0.078
Accuracy 57.8% 53.1% 27.3% 51.8% 50% 63.6%
The Support Vector Regression is clearly superior when used to forecast the sideways stock. Starting
with the errors, the clear trend presents very high values of MAE, opposed to the sideways errors that
even though they are not ideal, have a considerably lower average value. Also in the clear trend, the
returns never exceed the B&H returns for this stock, 33.2%, while in the sideways stock the daily forecast
exceeded the -0.74% of the B&H strategy. On average, the sharpe ratios of the sideways stock are lower
than the ones calculated for the clear trend investments, meaning it is riskier to invest in a stock that has
constant up and down moves. There are four accuracies higher or equal to 50% and one considerably
low, the monthly clear trend forecast.
Concluding, the SVR behaves better for a sideways stock than for a clear trend stock. This may be
due to the complexity of the model, that can perceive complex relations between the data and sometimes
failing in the more linear relations.
Tay and Cao [15] also applied SVM to financial time series forecasting. A true and fair comparison
cannot be done since they did not use the same data set, neither they use the same evaluation metrics,
being only concerned about the error measures. The only metric that appears in their work and in
this thesis is the mean absolute error, MAE. Besides this, is interesting to observe their conclusion to
contextualize the results obtained. Considering that only the best results are presented, the researches
obtained MAE values between 0.2361 and 0.4105. These values are very good comparing to the ones
obtained, and only the daily forecast for the sideways stock is close to Tay and Cao results [15].
4.4 ARIMA vs. KNN vs. SVR
This section presents a comparison between ARIMA, K-Nearest Neighbors and Support Vector Regres-
sion applied to daily, weekly and monthly forecast of a stock. The three algorithms are compared based
on the same metrics: mean absolute error, returns on investment, sharpe ratio and accuracy. Before
comparing the three implemented methods, it is important to understand the four metrics should not be
understood in the same way. The mean absolute error (MAE) is concerned about the precision of the
51
prediction and their deviation from the real values, so if one is more interested in precise forecast should
look and analyze the MAE. On the other hand, the returns on investment (ROI), the sharpe ratio (SR)
and the accuracy are more concerned with the utility of these predicted values since the algorithms are
applied to financial data. These three metrics reflect the results of a very simple strategy based on the
obtained predictions, so if one is more interested in defining strategies to invest in the stock market,
these are the metrics to look at.
4.4.1 Clear Trend Stock
The performance of the three algorithms is discriminated in Table 4.12 and a deep comparison is con-
ducted below.
Table 4.12: ARIMA vs. KNN vs. SVR in a Clear Trend Stock.Daily Weekly Monthly
MAEARIMA 0.695 1.256 2.025KNN 10.838 12.073 16.080SVR 4.552 2.463 13.004
ROIARIMA 34.5% 36.7 % 40.7 %KNN -8.3% -16.6% -11.9%SVR 23.4% 8.7% -24.3%
SRARIMA 1.906 2.752 3.556KNN -0.591 -1.568 -2.754SVR 1.368 0.725 -2.166
AccuracyARIMA 57.8% 67.3% 90.9%KNN 44.6% 40% 18.2%SVR 57.8% 53.1% 27.3%
It is clear that exists one that stands out: the ARIMA model. ARIMA has the lowest value for MAE
in all daily, weekly and monthly forecast, and this difference is more pronounced as the forecast range
increases. The higher value of MAE is for KNN when trying to forecast in a monthly range, being 16.400.
In general, the KNN errors for this stock are very high. The SVR also has high values of MAE, although
they are not so high as the KNN ones. It is observable that for all three cases the error increases with
the range of the forecast being higher for the weekly forecast and consequently for the monthly forecast.
To better understand the dimension of the errors obtained by the three algorithms, the daily predictions
made by each of them are compared against the real actual values of the training data sets. Figure 4.7
illustrates this comparison.
The model that is more close the actual price is the ARIMA model, as it was expected since it is the
model with the lowest daily MAE for the clear trend, 0.695. Relatively to KNN, in the beginning of the test
period, the algorithm seems to fit the model but quickly starts to output bad results. The same happens
with SVR. The SVR is able to more or less predict the volatility of the prices but not the actual values,
being a little above the actual values, with a MAE of 4.552.
Considering the returns, the comparison term for its evaluation is the B&H strategy which gives a
52
Figure 4.7: Comparison of the three algorithms in a clear trend stock.
return of 33.2% for this specific stock. This value is only exceeded by all ranges of ARIMA forecast, and
KNN and SVR give lower values in all the forecast ranges. Even though these two algorithms do not
give such good results as the ARIMA model, the SVR is superior to the KNN in the returns.
The sharpe ratio reflects a good strategy in all ARIMA predictions and also in the daily forecast for
SVR, proving again that SVR performs better than the KNN.
The accuracies are very high in the ARIMA case, and not so high in the two machine learning
algorithms. Again, even though the SVR does not show good accuracies, it performed better than the
KNN model that has accuracies all less than 50%.
Concluding, the ARIMA model performs very well in a clear trend stock, outperforming the B&H
strategy and also the two machine learning algorithms. The Support Vector Regression outperformed
the KNN taking into account all the evaluation metrics. The KNN is very weak in its predictions and
seems to fit the data only in the beginning of the test period.
4.4.2 Sideways Stock
Table 4.13 presents the performance results for the three algorithms in a sideways stock, the BEN stock,
that does not show a clear up or down trend. Contrary to the clear trend stock, there is no model or
algorithm that stands out in this situation. The errors are similar between the three models and on
average they increase with the range of the forecast. The KNN is still the solution that shows higher
values of MAE and the lowest error is found again in the daily forecast of the ARIMA model. Figure 4.8
shows how far the predictions are from the real values, in other words, they illustrate the values of MAE.
The ARIMA predictions are very close to the actual values, and the KNN performs a lot better com-
pared to the clear trend stock. The SVR does not show very precise values and it is not obvious just by
looking at figure 4.8 which of the KNN or SVR is more accurate. Only looking at MAE values is possible
to check that SVR outputs values more close to the actual prices.
Concerning the returns, the results may not seem very optimistic. In fact, this specific stock is very
53
Table 4.13: ARIMA vs. KNN vs. SVR in a Sideways Stock.
Daily Weekly Monthly
MAEARIMA 0.362 0.776 2.230KNN 0.810 3.378 5.574SVR 0.807 1.422 3.622
ROIARIMA -29.5% -4.7% -15.2%KNN -28.4% -1.5% -15.2%SVR 1.9% -4.4% -1.5%
SRARIMA -2.034 -0.318 -0.908KNN -1.704 -0.082 -0.908SVR 0.101 -0.227 -0.078
AccuracyARIMA 44.6% 44.9% 36.4%KNN 44.6% 40% 36.4%SVR 51.8% 50% 63.6%
Figure 4.8: Comparison of the three algorithms in a sideways stock.
volatile, with a very inconstant behavior, being more difficult to invest and have good results. The B&H
strategy gives a profit of -0.74% being only exceeded by the daily SVR forecast. In this volatile stocks,
even when the errors are small, the returns can be very low since it is difficult to predict if the price will
increase or decrease. ARIMA has the lowest value of returns for the daily forecast, -29.5%.
The sharpe ratio is lower than 1 for all situations meaning that even the lowest return values represent
a very risky investment. The accuracies are high for the SVR algorithm, and they are very low for the
KNN method. ARIMA exhibits average results of accuracy, but all inferior to 50%.
4.4.3 Overall comparison
Forecasting in the stock market is not an easy task, and forecast results can sometimes be dubious.
The great part of the works related to forecasting only use either error metrics or return/profit/accuracy
metrics. When dealing with financial data it is extremely important to look at both metric types and learn
54
to find the appropriate trade-off between the two. For example, the daily results for the ARIMA forecast
applied to the sideways stock have a very low error value of 0.362, and if this value was evaluated alone
without the remaining context it would seem a very good and precise result. The returns for the same
daily forecast are very low, being negative and almost -30%. This is just an example that results should
be analyzed carefully and inside the context.
To solve the problem of forecasting in the stock market, the ARIMA gives results very close to the
actual values and their results are consistent along the test period. This model performs a lot better
when applied to a clear trend stock, and it does not fit well when consistent up and down movements are
happening. When dealing with the clear trend stock outperforms the two machine learning algorithms.
ARIMA can also give very high returns when used to construct a strategy, giving returns way above the
current values. ARIMA is also a very simple statistic algorithm and it is not very complex, compensating
with its low computational effort. Besides this, for this work purposes, only values inside a range between
0 and 5 were tested since the computational resources were limited.
The two machine learning algorithms performed not as expected, being overpast by the ARIMA
model in the clear trend stock. Support Vector Machines performed better than the KNN model, mainly
because the KNN is a very simple and lazy-algorithm with a slow learning, and SVR is a more com-
plex one that with more complex kernels can perceive complex relations between data. Also, KNN is
more used in classification tasks, where feature scaling is very common and helps a lot the algorithms
performance.
Another clear conclusion is that looking at the MAE values, as the range of the forecast raises, also
it rises the error, and this happens for almost all situations. Also, the results prove that good values of
errors do not mean good returns, and good returns do not mean a good investment since the risk must
be taking into account. When dealing with KNN, there are no K values superior to 29, indicating that
there is no need to try very high values of K. Considering SVR, the polynomial kernel is the one with
better results, even though the RBF is mostly used with complex data such as financial data.
Table 4.14 shows which was the algorithm that obtained the best classification for each of the two
stocks in each of the four metrics.
Table 4.14: Best results for each stock.Best ARIMA KNN SVR
Clear Trend Stock
MAE xROI xSR xAccuracy x x
Sideways Stock
MAE xROI xSR xAccuracy x
Both the machine learning algorithms did not live up to the expectations. Looking at Figure 4.7 and
Figure 4.8, it seems that both KNN and SVR can fit the data in the beginning much better than they did in
the end. This can be caused by the fact that the hyper-parameters start to be out of data and the model
needs to be retrained. Taking this into consideration, the retraining of the machine learning algorithms
55
was conducted in order to compare the results and see if it gives more precise values.
4.5 Studying the impact of retraining KNN and SVR
The introduction of a retraining period is due to the fact that, concerning the mean absolute errors,
the two machine learning algorithms present good predictions at the beginning of the test period and
deteriorate their performance over time. This observation leads to the suspicion that the algorithms may
lose their ability since they were trained before the test period and their hyper-parameters can be out of
date. It is important to give the opportunity to the models to learn the price trend from the past, but it
also relevant that the algorithms use up to date information, not being only influenced by data located
too far in the past.
Taking this into consideration, a retraining step is introduced for both KNN and SVR in order to check
if their results in terms of mean absolute error improve. This retraining step is incorporated during the
test period and the retraining period varies between 5, 10 and 15. A retraining period of 5 means that for
each 5 executed trades, the algorithm is retrained and new hyper-parameters are calculated and used
until the next retrain.
This approach is used for the KNN and SVR performance in both stocks for a daily forecast and
compared with the results obtained in sections 4.2 and 4.3.
The new daily results of the KNN with a 5-period retraining for the clear trend stock are illustrated in
Figure 4.9. The KNN MAE value dropped from 10.838 to 1.330. KNN had the most inadequate results
Figure 4.9: KNN with Retraining for a Clear Trend Stock
in relation to the actual values for the clear trend forecast, and the retraining every 5 trades improved
a lot its error and consequently the returns, shaper ratio and accuracy of the strategy based on KNN
predictions. These results were obtained for the same test period and number of features that were used
in Section 4.2.
Even though the KNN had a better performance in the sideways stock, its results are also improved
56
by the introduction of a retraining period, reducing the error and improving the returns, sharpe ratio and
accuracy of the algorithm. The optimal retraining period was again every 5 trades. The improvements
are illustrated in Figure 4.10 for the daily results. The KNN MAE value dropped from 0.810 to 0.545.
Figure 4.10: KNN with Retraining for a Sideways Stock.
Also, the SVR was way below the expectations and a retraining was introduced during the test period.
The retrain period varies between 5, 10 and 15. For the clear trend, the SVR predicted values that were
higher than the actual prices, even though the shape of the output time series was similar. With the
introduction of the retraining every 5 trades, the SVR error improved a lot but the same did not happen
with the returns, sharpe ratio and accuracy. The results are illustrated in Figure 4.11. The SVR MAE
value dropped from 4.552 to 0.990.
Figure 4.11: SVR with Retraining for a Clear Trend Stock.
Concerning the sideways stock, the results of the SVR errors are better than without taking the
retraining. Again the optimal period number is every 5 trades. The SVR MAE value dropped from 0.807
57
to 0.366. The results are illustrated in Figure 4.12.
Figure 4.12: SVR with Retraining for a Sideways Stock.
Concluding, the retraining of the machine learning algorithms is something to take in consideration
when forecasting for a long test period and show improvements in every case. The chosen period for all
the cases was every 5 trades. This may not work in all situations since it increases the computational
effort. This period should be optimized for each algorithm in its specific context.
58
Chapter 5
Conclusions and Future Work
5.1 Conclusions
The objectives of this work were to study stock price sequences as time series and to introduce the
use of forecast to predict future prices. To do this task, the objective was to implement one statistical
model and two machine learning techniques, conducting a comparison between the three of them when
forecasting with a daily, a weekly and a monthly range. For that, price sequences were used in all three
cases in order to have a fair comparison, even though the machine learning algorithms are commonly
used with technical indicators.
To complete the proposed goals, the chosen techniques were ARIMA model, K-Nearest Neighbors,
and Support Vector Regression, being the ARIMA model a statistical approach, the K-Nearest Neighbors
a simple machine learning algorithm, and Support Vector Regression a more complex and advanced
machine learning technique. For each of the three algorithms, an optimization of the hyper-parameters
was conducted in order to have the best possible model for each situation base on the lowest MSE. To
compare the three solutions, a simple strategy was calculated based on the forecasted values and four
metrics were used to evaluate both the prediction and the strategy: the mean absolute error, the returns
on investment, the sharpe ratio, and the accuracy. The algorithms were tested in two different types of
stocks: a clear trend stock and a sideways stock.
The best performance for the clear trend stock corresponds to the ARIMA model that exceeds the
B&H strategy, with returns of 40.7%. For the sideways stock, the SVR was the one with highest returns,
even though they were not very good. The two machine learning algorithms demonstrated a good fit in
the beginning of the test period and started to degrade their performance over time. The introduction of
a retraining period was tested in order to find if the results could be improved. Due to this introduction,
both algorithm results were better than the ones obtained before, proving the point that model retraining
should be considered when forecasting for a long test period.
There are some points that are worth to be enumerated in order to have a brief understanding of the
reached conclusions:
1. The choice of the evaluation metrics are extremely important, and a reliable comparison between
59
algorithms should not be conducted based only on one metric;
2. The ARIMA model has very good results in a clear trend stock, while in a sideways stock the same
does not happen;
3. K-Nearest Neighbors is a very simple algorithm that does not fit the stock data very well due to
the complexity of price moves and simplicity of the algorithm, being the weakest implemented
algorithm;
4. Support Vector Regression performs better than the K-Nearest Neighbors in modeling and fore-
casting financial data but it exceeded by the ARIMA model in a clear trend stock;
5. Machine learning algorithms can lose their validity and introducing a retraining along the test period
improves a lot the error results.
Concluding, forecast is a very difficult task, and even more in the financial field where the prices can
be so unpredictable. The flow of a stock price can have multiple channels of influence, and in this work
only past price sequences were used, probably impairing the results. Machine learning algorithms are
used in the financial field more often as classifiers than as regressors, which turned the task of forecast a
continuous value (close price) more challenging. In the end, the three algorithms show very interesting
results even though only price sequences were used as their input, making the point that the idea of
forecast future prices as continuous variables can be a very promising tool for investors and traders.
5.2 Future Work
For future works, the present thesis should be seen as a starting point in the forecast of stock prices as
continuous variables. To continue this work, the following approaches can be conducted:
1. For each of the implemented algorithms, conduct a more deep study about the influence of each
of the hyper-parameters for each of the models;
2. Evaluate the models with more refined and complex strategies since the implemented one was
very simplistic and only to gain an idea of how useful the predictions were;
3. Optimize the retraining periods since only 3 periods were tested;
4. Integrate the solution into a Big Data platform in order to process more data in a more quickly way;
5. The use of different algorithms to forecast future prices that can work as regressors;
6. Combining price sequences with fundamental analysis in order to enter with more channels of
influence.
60
Bibliography
[1] B. Marr. “A Short History of Machine Learning”. Forbes, pages 1–
2, 2016. URL http://www.forbes.com/sites/bernardmarr/2016/02/19/
a-short-history-of-machine-learning-every-manager-should-read/#7eaad602323f.
[2] Investopedia.com. “Technical Analysis Tutorial”. pages 1–42, 2010. URL https://www.
investopedia.com/exam-guide/series-7/portfolio-management/technical-analysis.asp.
[3] G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time Series Analysis: Forecasting
and Control, 5th Edition. 2015.
[4] T. G. Dietterich. “Machine learning in ecosystem informatics and sustainability”. 2009. ISBN
9781577354260. doi: 10.1007/978-3-540-75488-6 2.
[5] A. Ng. “Lecture CS229: Machine Learning.” Stanford University. 2011. URL http://cs229.
stanford.edu.
[6] R. C. Steorts. “Lecture STA 325, Chapter 3.5 ISL - Comparison of Linear Regression with K-Nearest
Neighbors.” Duke University. URL http://www2.stat.duke.edu/~rcs46/lectures_2017/03-lr/
03-knn.pdf.
[7] E. G. Chan, S. Fellow, P. H. Director, and S. Program. “Forecasting the S&P 500 Index Using Time
Series Analysis and Simulation Methods”. Submitted to the MIT Sloan School of Management and
the School of Engineering, 2009.
[8] E. A. Gerlein, M. McGinnity, A. Belatreche, and S. Coleman. “Evaluating machine learning classifi-
cation for financial trading: An empirical approach”. Expert Systems with Applications, 54:193–207,
2016. ISSN 09574174. doi: 10.1016/j.eswa.2016.01.018. URL http://dx.doi.org/10.1016/j.
eswa.2016.01.018.
[9] M. M. Rounaghi and F. Nassir Zadeh. “Investigation of market efficiency and Financial Stability
between S&P 500 and London Stock Exchange: Monthly and yearly Forecasting of Time Series
Stock Returns using ARMA model”. Physica A: Statistical Mechanics and its Applications, 456:
10–21, 2016. ISSN 03784371. doi: 10.1016/j.physa.2016.03.006. URL http://dx.doi.org/10.
1016/j.physa.2016.03.006.
61
[10] T. Vantuch and I. Zelinka. “ECC 14 - Evolutionary Based ARIMA Models for Stock Price Forecast-
ing”. 2014. doi: 10.1007/978-3-319-10759-2 25. URL https://link.springer.com/content/
pdf/10.1007%2F978-3-319-10759-2_25.pdf.
[11] J. Kamruzzamana and R. A. Sarkerb. “Comparing ANN Based Models with ARIMA for Prediction of
Forex Rates”. ASOR BULLETIN, 22(2):2–11, 2003. URL http://www.asor.org.au/publication/
files/jun2003/Joarder.pdf.
[12] J. Mandziuk and P. Rajkiewicz. “Neuro-evolutionary system for FOREX trading”. 2016 IEEE
Congress on Evolutionary Computation, CEC 2016, pages 4654–4661, 2016. doi: 10.1109/CEC.
2016.7744384.
[13] P. Yoo, M. Kim, and T. Jan. “Machine Learning Techniques and Use of Event Information for
Stock Market Prediction: A Survey and Evaluation”. International Conference on Computational
Intelligence for Modelling, Control and Automation and International Conference on Intelligent
Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), 2:835–841, 2007. doi:
10.1109/CIMCA.2005.1631572. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=1631572.
[14] K. J. Kim. “Financial time series forecasting using support vector machines”. Neurocomputing, 55
(1-2):307–319, 2003. ISSN 09252312. doi: 10.1016/S0925-2312(03)00372-2.
[15] L. Cao and F. E. H. Tay. “Application of support vector machines in financial time series forecasting”.
Omega, 29(4):309–317, 2001. ISSN 03050483. doi: 10.1016/S0305-0483(01)00026-3.
[16] W. H. Chen, J. Y. Shih, and S. Wu. “Comparison of support-vector machines and back propagation
neural networks in forecasting the six major Asian stock markets”. International Journal of Electronic
Finance, 1(1):49, 2006. ISSN 1746-0069. doi: 10.1504/IJEF.2006.008837. URL http://www.
inderscience.com/link.php?id=8837.
[17] Y. Chen and Y. Hao. “A feature weighted support vector machine and K-nearest neighbor algorithm
for stock market indices prediction”. Expert Systems with Applications, 80:340–355, 2017. ISSN
09574174. doi: 10.1016/j.eswa.2017.02.044.
[18] R. P. da Costa Barbosa. “Agents in the Market Place An Exploratory Study on Using Intelligent
Agents to Trade Financial Instruments”. 2011.
[19] D. Wang, X. Liu, and M. Wang. “A DT-SVM strategy for stock futures prediction with big data”.
Proceedings - 16th IEEE International Conference on Computational Science and Engineering,
CSE 2013, pages 1005–1012, 2013. ISSN 1949-0828. doi: 10.1109/CSE.2013.147.
[20] F. Liu, P. Du, F. Weng, and J. Qu. “Use clustering to improve neural network in financial time series
prediction”. Proceedings - Third International Conference on Natural Computation, ICNC 2007, 2
(Icnc):89–93, 2007. doi: 10.1109/ICNC.2007.796.
62
[21] J. Leskovec and A. Rajaraman. “Lecture CS345a: Data Mining - Clustering algorithms”. Stanford
University. 1975. URL http://dl.acm.org/citation.cfm?id=540298.
[22] R. Dash and P. K. Dash. “A hybrid stock trading framework integrating technical analysis with
machine learning techniques”. The Journal of Finance and Data Science, 2(1):42–57, 2016. ISSN
24059188. doi: 10.1016/j.jfds.2016.03.002. URL http://linkinghub.elsevier.com/retrieve/
pii/S2405918815300179.
63
64