journal of class files, vol. , no. , november ... of class files, vol. , no. , november 2016 1...
TRANSCRIPT
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 1
Classification and Clustering of Stocks,using Genetic Algorithms and
Fundamental AnalysisDavid Moura, MSc Student, IST
Abstract—Since the last two decades the ease of access to information has grown exponentially, making iteasier to analyze and use this data in every field of science, including computational finance. This work presentsan architecture made from scratch of a trading system that classifies stocks using two techniques, in order toconclude which one is superior: one with user input parameters, and an unsupervised one using a geneticalgorithm to optimize clustering position, with a constant number of clusters. A genetic algorithm is also appliedto optimize fundamental indicators to give buy and sell signals in each of the groups obtained, in order toconclude if stocks in the same group behave in similar fashion. Results have shown that the group with bestresults obtained with user input parameters is superior to the group with best returns obtained with theclustering algorithm. However the clustering algorithm classified stocks better, having increased performanceover the user input method when used with the genetic algorithm using fundamental indicators. The proposedsystem was implemented from scratch, and was contains optimization modules, processing modules and atrading simulator.
Index Terms—Genetic Algorithms; Fundamental Analysis; Fundamental Indicators; Classification; Clustering;Stock Market
F
1 INTRODUCTION
THIS section is an introduction to the subject ofcomputational techniques applied to compa-
nies’ financial data as a way to find trading rulesand patterns in the stock market.
1.1 Motivation and ContextSeveral techniques are used to predict stock mar-ket quotes, being the most popular ones ArtificialIntelligence (AI) methods to optimize financialindicators’ parameters. There are two types offinancial indicators: fundamental and technicalindicators. Other methods include the evaluationof each stock of a certain type, being this typedefined by its sector or any other rule.
1.2 Problem StatementThe problem here is to construct a system thatevaluates the stock market and accurately groups
stocks that show similar behaviors.
1.3 Proposed Solution
The implemented solution is a system that classi-fies companies into groups based on their finan-cial statements, in two ways: a supervised waywith parameters defined by the user and an un-supervised way, applying a genetic algorithm tooptimize clustering. After both classifications, theapplication uses a genetic algorithm once more tooptimize indicators to buy and sell stocks, com-paring all the results to conclude which methodclassifies better. The proposed system was imple-mented in C++ with about 9000 lines of code andit was made to be used by third parties that wishto improve its features.
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 2
1.4 Document StructureIn this section the document structure is pre-sented. Section 2 shows theoretical concepts re-quired to develop this project and an overviewof the results obtained by related works. Section3 documents the proposed solution. Section 4presents a validation of the system. Section 5summarizes this work, concluding the achieve-ments, limitations and proposing future improve-ments.
2 STATE OF ART
2.1 Financial IndicatorsFinancial indicators analyze statistics and presentthat information in the form of ratios. These in-dicators will give better results if used alongsidethe right strategy [1].
There are several types of financial indicators,divided into Fundamental Indicators (FI) or Tech-nical Indicators, which come from FundamentalAnalysis (FA) and Technical Analysis (TA).
2.1.1 Fundamental IndicatorsFA is used to check the intrinsic value of a certainmarket, industry or company, being FA appliedto the latter the most used one. FA uses finan-cial statements to know if a company is underor overvalued. FI are constructed using FA, andthese type of indicators do not take into accountmarket trends, only its intrinsic value, or in otherwords, the real raw value of the company [2].
FA has been made famous by value investorslike Benjamin Graham and Warren Buffet (fortheir value investment strategies [3], [4]).
Using information given in financial state-ments, one can calculate some fundamental ratiosused to compare companies, to decide which onesare the best to invest in.
2.1.2 Technical IndicatorsTA and FA use different approaches towards in-vestment, since TA uses movement of stock prices[5] and volume of transactions [6] as the maininformation to predict stock markets. TIs lookfor patterns in past data and use those patternsto forecast market tendencies [7]. TA generatestrading rules by analyzing previous patterns oftechnical indicators [2].
2.2 Computational Intelligence (CI) Algo-rithms
This work uses Genetic Algorithms (GA). GAs arethe most used kind of Evolutionary Algorithms(EA), an CI technique.
2.2.1 Genetic Algorithms
GAs are stochastic algorithms based on the evo-lutionary process of species. A simple GA pseu-docode is given in algorithm 1.
t← 0;P (t)← random;Evaluation P (t);while notEndcondition do
Pp(t)← Selection of parents from P (t);Pc(t)← Crossover from Pp(t);Pm(t)←Mutation of Pc(t);Evaluation Pm(t);P (t+ 1)← New Generation Creationfrom (P (t), Pm(t));t← t+ 1 ;
endAlgorithm 1: Simple GA
The basic operators of a GA are selection,crossover and mutation. There are also other op-erators or techniques that can be applied to theGA in order to improve its results, such as elitism(propagating a percentage of the best individualsin a population into the next generation) or ran-dom immigrants (substituting the worst individ-uals in the population by random individuals).
2.3 Clustering
Clustering is an unsupervised technique, sinceit does not use preclassified data. Instead thealgorithm discovers similarities (in the requestedattributes) between objects of the set, groupingthem in the same cluster. In [8] a GA is used tooptimize clustering, using the Calinski-Harabaszindex as fitness function, obtaining better resultsthan classical clustering methods as K-means andFuzzy C-Means (FCM) (for more information see[9] and [10]).
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 3
2.4 Related Work Results
In this section, some previous work results anddata used are analyzed and compared betweenthem in table 2.4.
Work Date Approach FinancialData ApplicationBenchmarkValidation
Period Returns
[11] 2015 MOGA FI & TI
PortfolioCom-posi-tion
S&P500 2013 -2014 28.3%
[5] 2011 GA TI
PortfolioCom-posi-tion
DJI 2003 -2009
BetterthanB%HandRan-domWalk
[12] &[13] 2008 R-
SPEA2RawData
PortfolioCom-posi-tion
FTSE100
May2004 -
Decem-ber
2005
Betterthe
SPEA2ap-
proach
[14] 2004 EC MarkowitzModel
FundMan-age-mentCom-posi-tion
S&PDatabase N/A
Betterthan2PLS,
SA andTS
[15] 2015 GA TI
CrudeOil
FutureMarket
CrudeOil
Market
1987 -2013
BetterthanB&H
[16] 2010
MEPA&
CPDA& RST& GA
TI
PortfolioCom-posi-tion
TAIEX 2000 -2005
BetterthanB&H,GA orRST
alone
[6] 2009ANN &
FL &GA
TI
PortfolioCom-posi-tion
S%P500 1997 -2002
BetterthanB&H,
NN, NFor GAalone
[17] 2007RST &FuzzyRules
TIForex
Ex-change
USD vsothermain
curren-cies
2004 -2006 33.95%
[18] 2009 GP TI
PortfolioCom-posi-tion
S&P500 1990 -2002
BetterthanB&H
[19] 2014 GA TI
PortfolioCom-posi-tion
TAIEX
118days(end
date 14June2014)
BetterthanB&H
[20] 1990 ANN RawData
PortfolioCom-posi-tion
TOPIX
January1987 -
Septem-ber
1989
BetterthanB&H
[21] 2011 ANFIS TI
PortfolioCom-posi-tion
IBM &Wal-
mart &Citi-
group&
Wyeth& GM
24thAugust1994 -30th
August2006
240.32%, better
thanS&P500and DJI
3 ARCHITECTURE
This Chapter will give a description of the systemarchitecture. First it will be given an overview ofthe proposed solution, afterwards a module styledescription of the most important modules andlastly, an explanation of other implementationdetails. The architecture of the proposed solutionwas made from scratch in C++. The implementa-tion has 22 C++ classes, and a size of about 9000lines of code.
3.1 Module View of the Global Architecture
The overview of the module architecture of theproposed solution is presented in this section (seefigure 1).
There are nine main modules in the system,each one with a specific functionality.
Fig. 1. Module View of the Architecture
See figure 2 to better understand the data flowof the system.
3.2 Modules
This section explains in more detail the mostimportant modules of this work. The modules areconnected between them as showed in figure 2.
3.2.1 Fundamental Analysis (FA)This module is a data analysis module, that willcompute growth analysis and FIs from the fi-nancial statements of the companies (this datawas obtained with [22] algorithm). The indicatorsused in this work will always have a maximiza-tion objective. They are the following:
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 4
Fig. 2. Modules Data Flow
Debtindicator = 1− Total DebtTotal Assets
(1)
Payout Ratio =DPSEPS
(2)
ReturnOnEquity =Net Income (NI)
Total Equity(3)
ProfitMargin =NI
Revenue(4)
RevenueGrowth =RevActual −RevLastY ear
RevLastY ear(5)
NetIncomeGrowth =NIActual −NILastY ear
NILastY ear(6)
∆RG =RGActual −RGLastY ear
RGLastY ear(7)
∆NIG =NIGActual −NIGLastY ear
NIGLastY ear(8)
∆CFOA =CFOAActual − CFOALastY ear
CFOALastY ear(9)
3.2.2 ClassifierThis module does the classification of stocks foreach quarter, based on the approach used by PeterLynch in [23] (see figure 3 for better understand-ing of the structure).
Very Good
Good
Normal
Bad
Very Bad
MediumSmall Big
Fast Grower
Stalwart
Slow Grower
Potential
Turn Around
Growth
Size
Fig. 3. Types’ Quadrants
The classification given has only into accountthe size and the growth of the company (seefigure 3), and not the financial health, that wasimplemented for possible human analysis.
An example of how size classification is com-puted is given (health is similar). For growthclassification the method is the same but withnine instead of three thresholds:
• Classification = 1→ V alue < THLow
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 5
• Classification = 2→ THLow ≤ V alue <THHigh
• Classification = 3→ THHigh ≤ V alue
Afterwards the classification given is accord-ingly:
• Small→ Total Classification ∈]0, 1.5[• Medium→ Total Classification ∈ [1.5, 2.5[• Big→ Total Classification ∈ [2.5, 3]
In this work the only Size indicator used is thelast year’s Total Assets average, and the thresh-olds are1: THLow = 10B and THHigh = 25B.
The only growth indicator used is the revenueyearly growth, and the thresholds are: TH1 =−0.2, TH2 = −0.1, TH3 = −0.05, TH4 = −0.02,TH5 = 0, TH6 = 0.02, TH7 = 0.05, TH8 = 0.1,TH9 = 0.2.
And the only financial Health indicator usedis the last year TotalDebt
TotalAssets indicator average, andthe thresholds are: THLow = 0.3 and THHigh =0.7
(a) Health (b) Size
(c) Growth
Fig. 4. Classifier Structure
The type of the stock is, as said before, ob-tained from the evaluation of the size and growthof a company, based on the approach presentedby [23]. After a stock has its growth and size eval-uated, the type is determined by combinationsbetween them.
The 5 classifications given in this work are (seefigure 3):
• Slow Grower - Companies that are consid-ered Medium in terms of size and had aNormal or Good growth classification
1. the 5B and 10B represents respectively 5 and 10 Billiondollars
• Stalwart - Companies that are consideredBig in terms of size and had a Normal,Good or Very Good growth classification
• Fast Grower - Companies that are consid-ered Small or Medium in terms of size andhad a Very Good growth classification
• Potential - Companies that are consideredSmall in terms of size and had a Normalor Good growth classification
• Turn Around - Companies that had a Bador Very Bad growth classification, inde-pendently of their size.
3.2.3 Clustering
This module creates five clusters each quarter,and associates each stock to the nearest cluster inthe Growth × Size space. It uses the GA moduleto optimize the clusters’ positions, given at leastone year training.
Values are scaled to give more appropriateweights to both growth and size. These scaledunits are given in equation 10.
Size =Total Assets[Million$]
1000(10a)
Growth = ∆Revenue× 100 (10b)
The growth measure is the annual revenuegrowth, and the size measure is the average ofthe Total Assets over the last year.
There will be a fixed amount of 5 clusters (thesame amount as the types in the Classifier mod-ule), enumerated from A to E (the GA will recom-pute the best locations for clusters each quarterpassed). To maintain consistency, the first cluster(cluster A) will always be the one nearest to theorigin of the plane (Origin = Coordinates (0, 0)),B the second nearest, and so on.
3.2.4 GA
This module has two functionalities. One is tooptimize weights given to fundamental indica-tors, and other is to optimize the location of theclusters in the plane. The way each works isdescribed as follows:
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 6
3.2.4.1 Fundamental Analysis GA (FAGA): Since the Portfolios use FI, and FI requires amore long term analysis than TI, each time unit isconsidered a quarter. A generation is an iterationover the last 4 quarters of the GA. See figure 5 tosee a structure of a chromosome.
Fig. 5. FI Chromosome Representation
At the beginning of the algorithm the popu-lation is generated randomly. Each FI weight isinitialized as in equation 11a, the buy signal valueis initiated as in equation 11b, and the sell signalvalue initiated as in equation 11c.
r ∈ [0, 1] (11a)
b =N∑i=1
riai (11b)
s =b
2(11c)
In equations 11 N is the number of indicatorsused, r and a are different random numbers, b isthe buy signal value and s is the sell signal value.The s value is calculated as in equation 11c toguarantee that it is smaller and has a substantialpercentage difference from the b value.
Buy and sell signals are given as in equation12, where v is the value of indicator i, and w isthe weight of indicator i.
Signal =
{BUY →
∑Ni=1 vi × wi > b
SELL→∑N
i=1 vi × wi < s(12)
The fitness function will be the ROI of eachindividual, when simulating investment.
ROI =Return - Initial Investment
InitialInvestment(13)
3.2.4.2 Clustering GA: This is where thetraining and validation of the algorithm to op-timize clusters’ positions occur, inspired in theACGA algorithm from [8]. The GA module willrun the GA to find the best locations for clusters
centroids, being the output the centroids loca-tions in the Size×Growth plane, using Calinski-Harabasz index as fitness function. See figure 6 tosee the structure of a chromosome.
The chromosomes are constructed with clusterpoints. Activation values are auxiliary structuresthat will determine if a certain cluster point isgoing to be used or not. The number of clusterpoints and activation values is the possible num-ber of cluster positions given by the user. Clusterpoints and activation values are such that:
• Cluster Point is a tuple (size, growth),where size ∈ R+ and growth ∈ R
• Activation signal is a number n ∈ [0, 1]
Fig. 6. Clustering Chromosome Representation
The minimum training period of the algo-rithm is 1 year. Chromosomes are randomly ini-tiated, by assigning random values to Size andGrowth, in order to create theN different possiblepoints. This is done as in equation 14.
Sizerandom =maxsize
2× r1 (14a)
Growthrandom =maxgrowth
2× r2 (14b)
In equation 14 r1 and r2 are distinct random num-bers and maxsize, maxgrowth are the maximumsize and growth measured in the first year.
The activation values are obtained as in equa-tion 15.
Activationi =# Stocks ∈ Ci
# Stocks(15)
In equation 15 i is the index of the solution, andCi is the cluster of index i.
Within the number of solutions decided by theuser, the clusters with bigger activation values’are chosen as solutions of that chromosome. Thefitness function used is the Calinski-Harabaszmetric (sometimes called variance ratio criterion),as described in equations 16. Although this metricis usually used to optimize the number of clusterscreated, in this work it will be used only as a
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 7
fitness function to the GA that optimizes clusters’positions.
CH =SSB
SSW× N −K
K − 1(16a)
SSB =K∑j=1
nj‖Cj − C‖ (16b)
SSW =K∑j=1
∑i∈Ij
‖Xij − Cj‖ (16c)
C = (
∑Stockssize
N,
∑Stocksgrowth
N) (16d)
In equations 16 SSB is the between-clustervariance, SSW is the within-cluster variance, Nis the total number of stocks, K is the number ofclusters (the number of solution clusters), nj is thenumber of data points belonging to cluster indexj, C is the centroid of the dataset, Cj is cluster ofindex j, Ij is the set of data points belonging tocluster j and Xij is data point index i belongingto cluster index j.
3.2.4.3 Operators: Two types of selectionare implemented. A roulette wheel and a rankingselection. For this it will be used the fitness valuesshifted by the fitness of the worst individual, toavoid negative fitnesses.
The type of crossover done is a single pointcrossover.
The mutation implemented uses a ψ value todetermine the maximum value of the perturba-tion. This ψ is a percentage δ of the value that willbe mutated. After finding this value, the mutationvalue α will be computed as being a randomnumber between [0, ψ]
The mutation can be mathematically written:
ψ = δ × v (17)α = random ∈ [0, ψ] (18)
value = value± α (19)
In equation 17 value is the value receiving themutation, α is the perturbation applied to thevalue, ψ is the maximum value of the pertur-bation and δ is the percentage of perturbationchosen.
In this work values of δ = 0.5 in indicatorsweights, and δ = 0.2 in cluster positions are used.
3.3 Implementation Details
3.3.1 Configuration
The population size is 100, number of generationsis 50 for FA GA and 200 for clustering optimiza-tion, mutation rate is 5%, elitism rate is 40%, ran-dom immigration rate is 20% and chromosomesize is 11 for the FA GA (equals the number ofindicators) and 75 for the clustering optimization(number of possible cluster points). These valueswere chosen taking into account related worksand the execution time of the algorithm, althoughthey did not took into account neither the sizeof the problem or the features optimized in thiswork. Transaction costs of 0,3% of the stock valueare used in this work.
4 RESULTS
Every solution is compared with the S&P500 in-dex and the B&H of the dataset in the specifictime period. In the GAs used, the roulette wheelselection got better results, and these are the onespresented. These results are not compared withthe existing related works because I was unableto find works that traded in a similar time period,and because this work uses quarterly tradingonly, while most works use daily and weeklytrading.
In order to test the performance of the pro-posed solution, the metrics used in this workwere the following:
• ROI - The Return on Investment will beused to measure the amount of return,given in percentage, that a certain invest-ment had.
• Drawdown - This metric evaluates thebiggest peak-to-trough decline during aspecific period of investment.
• Sharpe Ratio - Used to measure the riskassociated with the return of a portfolio.
• Success rate of trades• Average time in the market• Average return per trade• Rate of positive quarters• Average return per quarter
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 8
4.1 Case Studies
All the results were computed without dividends(stocks’ closing quotes were used), and withtransaction costs of 0,3% in every buy or sellaction.
There is no comparison with related studiesbecause when this work was proposed I wasunable to find neither works that used the samevalidation period, neither works that traded onlyquarterly, and so this work uses the B&H and theS&P500 index (specially the latter) as comparisonwith the obtained results.
The execution time of the clusters’ positionsoptimization algorithm is of about 5 hours and ittakes 7 hours to create all of the portfolios once,with an i7 microprocessor.
4.1.1 Case Study I - User Input ClassificationThis case study presents the results of classifyingstocks using user input data.
The strategy is to do a portfolio containingonly a type of stock. This portfolio will buy stocksof a certain type the quarter that stock is classifiedas belonging to that type, and sell it in the quarterit stops belonging to that type. Since this classifi-cation is fixed, the values presented are the resultof a single run.
We can see in figure 7 the graph of the accu-mulated returns, and in figure 8 the results of theseveral metrics used.
One can see that every type got worst re-sults than the S&P500 index and the B&H of thedataset. The results of the Turnarounds type (thetype with worst results) was expected, since com-panies with worst revenue growths are expectedto have more trouble growing.
4.1.2 Case Study II - ClusteringThis case study presents the results of clusteringstocks by their size (total assets) and revenuegrowth, in the plane Size × Growth, with thealgorithm detailed in section 3. The algorithmhas run 10 times, and the solution that achievedthe median value of the averages of all quarters’fitnesses was used, as shown in table 1.
Just as in Case Study I, the portfolios will buystocks of a certain cluster the quarter that stock is
2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4SG 8,49% 6,43% 12,23% 15,80% 24,49% 26,26% 32,88% 39,70% 48,27% 49,73% 45,48% 55,84% 59,99% 55,89% 49,29% 48,82%SW 8,95% 5,07% 9,84% 12,69% 22,73% 26,40% 33,43% 43,68% 49,05% 54,11% 50,71% 61,50% 63,33% 59,31% 48,30% 49,44%FG 14,95% 4,99% 14,49% 18,63% 22,81% 25,09% 30,13% 41,31% 46,98% 51,52% 44,77% 49,81% 60,43% 58,93% 48,10% 48,00%POT 10,34% 6,24% 11,46% 16,84% 23,80% 22,96% 29,63% 39,81% 45,26% 46,31% 43,61% 54,85% 59,67% 57,23% 50,65% 52,15%TA 3,79% -0,40% 4,77% 6,63% 18,84% 23,85% 34,23% 41,89% 46,56% 48,73% 44,90% 50,16% 50,22% 45,78% 34,12% 36,13%B&H 10,25% 5,35% 11,73% 15,50% 26,05% 29,50% 37,89% 48,41% 55,24% 59,29% 54,97% 65,68% 68,94% 64,51% 52,44% 54,25%S&P500 11,99% 8,31% 14,54% 13,39% 24,76% 27,70% 33,69% 46,96% 48,87% 55,85% 56,80% 63,68% 64,40% 64,03% 52,64% 62,49%
-20%
0%
20%
40%
60%
80%
100%
Re
turn
s
Quarters
Type's Portfolios
Fig. 7. User input classification portfolios accumulated re-turns. The key is the following: SG - Slow Growers; SW- Stalwarts; FG - Fast Growers; POT - Potentials; TA -Turnarounds
Number of Trades
Rate of Trade
Success
Biggest Trade Return
Biggest Trade Loss
Average Time on
Market (in
Average Return
Per trade
Rate of Positive Quarters
Average Return Per
Quarter
4 Year Return
Sharpe Ratio
Drawdown
SG 132 65,91% 213,04% -65,39% 4,95 24,46% 68,75% 2,39% 44,09% 0,83 10,07%
SW 243 78,19% 253,67% -74,54% 8,51 36,03% 75,00% 2,84% 54,33% 0,87 13,77%
FG 72 73,61% 151,01% -34,50% 3,93 18,67% 75,00% 3,87% 79,32% 2,15 9,96%
POT 85 72,94% 226,94% -41,83% 5,94 26,76% 68,75% 2,90% 55,73% 1,41 6,81%
TA 203 59,11% 221,01% -80,57% 4,09 10,83% 62,50% 1,78% 30,13% 0,36 18,13%
Fig. 8. User input portfolios metrics table. The key is thefollowing: SG - Slow Growers; SW - Stalwarts; FG - FastGrowers; POT - Potentials; TA - Turnarounds
classified as belonging to that cluster, and sold inthe quarter it stops belonging to that cluster.
We can see in figure 9 the accumulated returnsof the portfolios, and in figure 10 the metricsevaluating each portfolio.
In figure 11 is shown how companies are
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 9
TABLE 1Average fitness per quarter of the clustering algorithm
Best Worst MedianCalinski-Harabasz 806,4 665,37 767,8
2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4A 9,13% 3,96% 10,33% 14,36% 24,87% 27,79% 36,86% 46,58% 53,08% 56,58% 52,68% 63,90% 66,93% 60,47% 51,00% 51,87%B 11,64% 7,31% 12,49% 17,81% 24,72% 23,45% 28,92% 38,72% 47,30% 54,08% 47,79% 53,25% 61,41% 60,19% 55,58% 56,06%C 10,06% 4,45% 7,59% 8,75% 19,24% 26,03% 30,88% 42,32% 48,97% 54,63% 51,91% 59,03% 57,79% 59,05% 43,94% 45,02%D 9,46% 15,49% 22,17% 19,05% 25,65% 30,77% 32,35% 43,80% 44,59% 40,37% 35,78% 43,74% 46,18% 47,82% 32,55% 34,37%E 28,17% 21,00% 36,22% 49,04% 59,20% 70,12% 78,24% 94,57% 91,90% 100,64% 97,29% 104,26% 96,95% 101,68% 77,85% 77,34%B&H 10,14% 5,25% 11,77% 15,51% 26,10% 29,49% 37,84% 48,32% 55,03% 59,04% 54,74% 65,60% 68,90% 64,51% 52,77% 54,54%S&P500 11,99% 8,31% 14,54% 13,39% 24,76% 27,70% 33,69% 46,96% 48,87% 55,85% 56,80% 63,68% 64,40% 64,03% 52,64% 62,49%
-10,00%
10,00%
30,00%
50,00%
70,00%
90,00%
110,00%
Re
turn
s
Quarters
Cluster's Portfolios
Fig. 9. Clustering classification portfolios. The key is thefollowing: A - Cluster A; B - Cluster B; C - Cluster C; D -Cluster D; E - Cluster E.
divided into clusters. It is remarkable how datais so sparse in the Size axis. Cluster A, is thebiggest cluster, containing most of the companies,while cluster E has only 4 companies in the fourthquarter of 2012.
One can conclude that the biggest companiesof the dataset got huge returns on bull markets,and huge losses on bear markets, having smallSharpe ratios, and huge drawdowns. Big compa-nies, outside this small set of huge companies,have the worst returns of all (the ones fromcluster D). It can also be concluded that revenuegrowth can be chosen as an indicator to pickstocks with relatively good Sharpe ratios.
With the scale and number of clusters usedthere was a better grouping in size than ingrowth. Due to the amount and sparsity of data,to get a better clustering in the growth axis (spe-cially in high values of size) it would be necessarymore cluster points.
Number of Trades
Rate of Trade Success
Biggest Gain Biggest LossAverage Time on Market (in
Quarters)
Average Return Per
trade
Rate of Positive Quarters
Average Return Per
Quarter4 Year Return Sharpe Ratio Drawdown
Cluster A
104 84,62% 202,85% -48,88% 6,64 46,58% 75,00% 2,76% 51,87% 0,76 15,93%
Cluster B
107 74,77% 128,38% -50,11% 3,55 14,75% 68,75% 2,91% 56,06% 1,50 6,29%
Cluster C
65 73,85% 167,25% -57,45% 8,80 29,69% 75,00% 2,48% 45,02% 0,60 15,11%
Cluster D
31 80,65% 107,34% -84,16% 6,23 23,36% 75,00% 1,99% 34,37% 0,51 15,27%
Cluster E
9 88,89% 150,20% -36,59% 7,33 66,46% 62,50% 4,00% 77,34% 0,66 23,83%
Fig. 10. Clustering portfolios metrics table. The key is thefollowing: A - Cluster A; B - Cluster B; C - Cluster C; D -Cluster D; E - Cluster E.
-80
-60
-40
-20
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900 1000
Gro
wth
[%
]
Size [1.000.000.000$]
Clusters 2012 Q4
A
B
C
D
E
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Fig. 11. Cluster’s representation in the whole plane in thefourth quarter of 2012. Cluster centers have the key ”ClusterX” where X is the name of the cluster (A, B, C, D or E), andthe points belonging to a certain cluster have only the letterof that cluster.
4.1.3 Case Study III - Using GAs to optimize FI
This case study studies the results of applyingGAs to give buy and sell signals on each of thegroups of case studies 1 and 2, and the wholedataset.
The algorithm ran 10 times for each model,and in the models that used the clustering al-
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 10
gorithm, the clustering algorithm has run twice,so for each portfolio that uses clustering, 5 runsused the first run of the clustering algorithm, andthe other 5 used the other run of the clusteringalgorithm.
One portfolio using user input classifications(the Slow Growers type) and two portfolios usingclustering showed improvements (cluster D andE). The results present how the GA improved acluster type portfolio (cluster E) and a classifi-cation type portfolio (Slow Growers), from theoriginal portfolios of case study 1 and 2.
2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4E-GA 29,50% 30,50% 41,65% 49,71% 61,30% 72,99% 74,92% 89,28% 94,96% 105,01% 93,62% 105,34% 118,52% 125,49% 103,61% 104,68%E 28,17% 21,00% 36,22% 49,04% 59,20% 70,12% 78,24% 94,57% 91,90% 100,64% 97,29% 104,26% 96,95% 101,68% 77,85% 77,34%GA 7,67% 3,89% 7,64% 10,12% 19,05% 20,35% 27,93% 36,93% 42,23% 46,18% 42,40% 51,94% 53,70% 50,07% 36,15% 41,30%
-20%
0%
20%
40%
60%
80%
100%
120%
140%
Re
turn
s
Quarters
Type's Portfolios
Fig. 12. GA optimization portfolios. The key ”SG” repre-sents the portfolio of type Slow Growers and the key ”E”represents the portfolio of cluster E, as constructed in casestudies 1 and 2. The suffix ”-GA” represents the portfoliosusing GA to optimize FI weights. The key ”GA” representsthe GA algorithm running over the whole dataset.
One can notice in figure 12 that in the portfolioE-GA there was a 27,34% increase in performance,most of which in the year 2015. The GA limitedthe losses of cluster E in 2015, and consequen-tially, increased the Sharpe ratio.
In figure 13 the metrics of these portfolios areshown. It reduced substantially the number oftrades done, when compared with portfolio SG(78,03% - from 132 trades in portfolio SG, to 29trades in portfolio SG-GA).
One can conclude from this case study that
Number of
Trades
Rate of Trade
Success
Biggest Trade Return
Biggest Trade Loss
Average Time on Market
(in Quarters)
Average Return
Per trade
Rate of Positive Quarters
Average Return
Per Quarter
4 Year Return
Sharpe
Ratio
Drawdown
GA 368 72,83% 276,67% -69,44% 5,49 24,02% 75,00% 2,29% 41,30% 0,67 17,55%
SG-GA 29 68,97% 264,07% -35,46% 3,79 29,16% 75,00% 2,49% 46,03% 0,84 16,45%
E-GA 8 100,00% 117,26% - 3,50 45,94% 87,50% 4,87% 104,68% 0,98 21,89%
Fig. 13. Clustering portfolios metrics table. The key is thefollowing: A - Cluster A; B - Cluster B; C - Cluster C; D -Cluster D; E - Cluster E.
Fundamental Indicators optimization has betterresults when applied to small sets of companies(as there were cluster E and D).
Also, the GA used in this work greatly re-duced the number of trades done in all portfolios,reducing potential losses, but also potential gains.In the case of the E-GA portfolio, the major dif-ference from the cluster E portfolio presented incase study 2 was the reduction of losses in 2015.
5 CONCLUSION
This work describes the implementation of a sys-tem that classifies stocks in the plane Size ×Growth with two methods, using in both thesame metrics of Fundamental Analysis, and thatuses Fundamental Indicators for trading simula-tion.
The classification methods used were success-ful, having both achieved great returns in the ex-pected groups. The genetic algorithm for weightoptimization of fundamental indicators improvedtwo cluster based portfolios, and one classifica-tion based portfolio. It also showed great successwhen applied to the cluster with bigger returns,improving the returns by 27,34%. To have betterresults in this optimization it would be necessarya bigger number of clusters so the algorithmcould optimize Fundamental Indicators weightsin smaller sets of stocks. These results allowed
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 11
to conclude that automatic classification usinggenetic algorithms is a better way of classifyingstocks than using human input, since it is notprone to human error and does a more carefulanalysis of the data, grouping companies withmore similar behaviors.
5.1 AchievementsThe main achievements in this work were thefollowing:
• The implementation from scratch of a ar-chitecture of a system containing data pro-cessing, genetic algorithms and a tradingsimulator.
• A classification method using Fundamen-tal Analysis and Genetic Algorithm tooptimize clusters’ positions, with a fixednumber of clusters.
• The comparison between an user inputclassification method and an automaticclassification method.
• A long term investment strategy based onthe implemented classification methods.
• The exclusive use of Fundamental Analy-sis and Fundamental Indicators with Ge-netic Algorithms.
5.2 Future Work• Try different number of clusters, and op-
timize the number of clusters created tomaximize the Calinski-Harabasz index.
• Use Technical Analysis to give the buy andsell signals of a portfolio.
• Tackle the Markwoitz Portfolio composi-tion problem.
REFERENCES
[1] G. S. Atsalakis and K. P. Valavanis, “Surveying stockmarket forecasting techniques part ii: Soft computingmethods,” Expert Systems with Applications, vol. 36, no.3, Part 2, pp. 5932 – 5941, 2009.
[2] S. B. Achelis, Technical Analysis from A to Z. McGrawHill New York, 2001.
[3] B. Greenwald, J. Kahn, P. Sonkin, andM. van Biema, Value Investing: From Grahamto Buffett and Beyond, ser. Wiley finance.John Wiley & Sons, 2004. [Online]. Available:https://books.google.pt/books?id=gvCzlskpZxoC
[4] M. Buffett and D. Clark, Warren Buffett and theInterpretation of Financial Statements: The Searchfor the Company with a Durable CompetitiveAdvantage. Scribner, 2008. [Online]. Available:https://books.google.pt/books?id=7iqO6rGdrAYC
[5] A. Gorgulho, R. Neves, and N. Horta, “Applying a{GA} kernel on optimizing technical analysis rulesfor stock picking and portfolio composition,” ExpertSystems with Applications, vol. 38, no. 11, pp. 14 072 –14 085, 2011.
[6] T. Chavarnakul and D. Enke, “A hybrid stock tradingsystem for intelligent technical analysis-based equiv-olume charting,” Neurocomputing, vol. 72, no. 1618,pp. 3517 – 3528, 2009, financial EngineeringCompu-tational and Ambient Intelligence (IWANN 2007).
[7] R. D. Edwards, J. Magee, and W. C. Bassetti, Technicalanalysis of stock trends. CRC Press, 2007.
[8] C. Raposo, C. H. Antunes, and J. P. Barreto, AutomaticClustering Using a Genetic Algorithm with New SolutionEncoding and Operators. Cham: Springer InternationalPublishing, 2014, pp. 92–103.
[9] J. C. Dunn, “A fuzzy relative of the isodata processand its use in detecting compact well-separated clus-ters,” 1973.
[10] J. C. Bezdek, Pattern recognition with fuzzy objectivefunction algorithms. Springer Science & BusinessMedia, 2013.
[11] A. Silva, R. Neves, and N. Horta, “A hybrid approachto portfolio composition based on fundamental andtechnical indicators,” Expert Systems with Applications,vol. 42, no. 4, pp. 2036 – 2048, 2015.
[12] G. Hassan, “Multiobjective robustness for portfoliooptimization in volatile environments,” in In Proc.GECCO 08. ACM, 2008, pp. 1507–1514.
[13] G. Hassan and C. D. Clack, “Robustness of multipleobjective gp stock-picking in unstable financial mar-kets: Real-world applications track,” in Proceedings ofthe 11th Annual Conference on Genetic and EvolutionaryComputation, ser. GECCO ’09. New York, NY,USA: ACM, 2009, pp. 1513–1520. [Online]. Available:http://doi.acm.org/10.1145/1569901.1570104
[14] M. Ehrgott, K. Klamroth, and C. Schwehm, “An{MCDM} approach to portfolio optimization,” Eu-ropean Journal of Operational Research, vol. 155, no. 3,pp. 752 – 770, 2004, traffic and Transportation SystemsAnalysis.
[15] L. Wang, H. An, X. Liu, and X. Huang, “Selectingdynamic moving average trading rules in the crudeoil futures market using a genetic approach,” AppliedEnergy, vol. 162, pp. 1608 – 1618, 2016.
[16] C.-H. Cheng, T.-L. Chen, and L.-Y. Wei, “A hybridmodel based on rough sets theory and genetic al-gorithms for stock price forecasting,” Information Sci-ences, vol. 180, no. 9, pp. 1610 – 1629, 2010.
[17] S. Yao, M. Pasquier, and C. Quek, “A foreign exchangeportfolio management mechanism based on fuzzyneural networks.” in IEEE Congress on EvolutionaryComputation. IEEE, 2007, pp. 2576–2583.
[18] D. Lohpetch and D. Corne, “Discovering effectivetechnical trading rules with genetic programming:Towards robustly outperforming buy-and-hold,” in
JOURNAL OF CLASS FILES, VOL. , NO. , NOVEMBER 2016 12
Nature & Biologically Inspired Computing, 2009. NaBIC2009. World Congress on. IEEE, 2009, pp. 439–444.
[19] M. Radeerom, “Automatic trading system based ongenetic algorithm and technical analysis for stockindex,” International Journal of Information Processingand Management, vol. 5, no. 4, p. 124, 2014.
[20] T. Kimoto, K. Asakawa, M. Yoda, and M. Takeoka,“Stock market prediction system with modular neuralnetworks,” in Neural Networks, 1990., 1990 IJCNN In-ternational Joint Conference on, June 1990, pp. 1–6 vol.1.
[21] Z. Tan, C. Quek, and P. Y. Cheng, “Stock trading withcycles: A financial application of {ANFIS} and rein-forcement learning,” Expert Systems with Applications,vol. 38, no. 5, pp. 4741 – 4755, 2011.
[22] A. Fernandes, An Evolutionary Computing Approach toFinancial Portfolio Management Based on Growth Stocks& Sector/Industry Distribution. Instituto Superior Tc-nico, May 2016.
[23] P. Lynch and J. Rothchild, One Up On Wall Street:How To Use What You Already Know To Make MoneyIn. Simon & Schuster, 2012. [Online]. Available:https://books.google.pt/books?id=TYOdIrFJ2SkC