evolutionary design of decision trees

20
Overview Evolutionary design of decision trees Vili Podgorelec, Matej ˇ Sprogar and Sandi Pohorec Decision tree (DT) is one of the most popular symbolic machine learning ap- proaches to classification with a wide range of applications. Decision trees are especially attractive in data mining. It has an intuitive representation and is, therefore, easy to understand and interpret, also by nontechnical experts. The most important and critical aspect of DTs is the process of their construction. Sev- eral induction algorithms exist that use the recursive top-down principle to divide training objects into subgroups based on different statistical measures in order to achieve homogeneous subgroups. Although being robust and fast, generally providing good results, their deterministic and heuristic nature can lead to sub- optimal solutions. Therefore, alternative approaches have developed which try to overcome the drawbacks of classical induction. One of the most viable approaches seems to be the use of evolutionary algorithms, which can produce better DTs as they are searching for globally optimal solutions, evaluating potential solutions with regard to different criteria. We review the process of evolutionary design of DTs, providing the description of the most common approaches as well as refer- ring to recognized specializations. The overall process is first explained and later demonstrated in a step-by-step case study using a dataset from the University of California, Irvine (UCI) machine learning repository. C 2012 Wiley Periodicals, Inc. How to cite this article: WIREs Data Mining Knowl Discov 2013, 3: 63–82 doi: 10.1002/widm.1079 INTRODUCTION D ecision tree (DT) is a classifier commonly rep- resented as some tree structure. DTs are built with DT induction algorithms, which are robust, have relatively low computational cost, and are able to work with redundant attributes. 1 Most induction al- gorithms use a greedy top-down recursive partition- ing strategy. Attributes to be tested at internal nodes are selected according to different measures: distance- based measures, 2 gain ratio, 3 Gini index, 4 and in- formation gain. 5 Greedy search can lead to subopti- mal solutions as recursive partitioning of the training dataset can result in data overfitting. Evolutionary algorithms (EAs) are able to over- come the problems of convergence to suboptimal so- lutions as they execute a robust global search. Com- pared with greedy methods, this results in the im- Correspondence to: [email protected] Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia DOI: 10.1002/widm.1079 proved handling of attribute interactions. 6 EAs are inspired by the principle of natural evolution. They use a population of individuals (candidate solutions) that evolve through generations while being subjected to exchange of genetic material (reproduction), mu- tation, and pressure to adapt to its environment (se- lection with regard to fitness). Each individual is eval- uated according to a fitness function and at each generation fitter individuals have a higher probabil- ity of advancing to the next population and repro- ducing. EAs are increasingly being used to evolve DTs because they provide accurate solutions and are able to maintain comprehensibility. Comprehensibil- ity can be assured by evaluating candidate solutions with regard to both the accuracy and the size of the tree. This paper gives a general overview of the evolutionary approach to DT construction by de- scribing the most relevant achievements and tech- niques in this field and by applying the described methods in an informative and easy to follow case study; for more detailed technical description and discussions, other papers are available. 1 The paper Volume 3, March/April 2013 63 c 2012 John Wiley & Sons, Inc.

Upload: sandi

Post on 04-Dec-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Overview

Evolutionary design of decisiontreesVili Podgorelec,∗ Matej Sprogar and Sandi Pohorec

Decision tree (DT) is one of the most popular symbolic machine learning ap-proaches to classification with a wide range of applications. Decision trees areespecially attractive in data mining. It has an intuitive representation and is,therefore, easy to understand and interpret, also by nontechnical experts. Themost important and critical aspect of DTs is the process of their construction. Sev-eral induction algorithms exist that use the recursive top-down principle to dividetraining objects into subgroups based on different statistical measures in orderto achieve homogeneous subgroups. Although being robust and fast, generallyproviding good results, their deterministic and heuristic nature can lead to sub-optimal solutions. Therefore, alternative approaches have developed which try toovercome the drawbacks of classical induction. One of the most viable approachesseems to be the use of evolutionary algorithms, which can produce better DTs asthey are searching for globally optimal solutions, evaluating potential solutionswith regard to different criteria. We review the process of evolutionary design ofDTs, providing the description of the most common approaches as well as refer-ring to recognized specializations. The overall process is first explained and laterdemonstrated in a step-by-step case study using a dataset from the University ofCalifornia, Irvine (UCI) machine learning repository. C© 2012 Wiley Periodicals, Inc.

How to cite this article:WIREs Data Mining Knowl Discov 2013, 3: 63–82 doi: 10.1002/widm.1079

INTRODUCTION

D ecision tree (DT) is a classifier commonly rep-resented as some tree structure. DTs are built

with DT induction algorithms, which are robust, haverelatively low computational cost, and are able towork with redundant attributes.1 Most induction al-gorithms use a greedy top-down recursive partition-ing strategy. Attributes to be tested at internal nodesare selected according to different measures: distance-based measures,2 gain ratio,3 Gini index,4 and in-formation gain.5 Greedy search can lead to subopti-mal solutions as recursive partitioning of the trainingdataset can result in data overfitting.

Evolutionary algorithms (EAs) are able to over-come the problems of convergence to suboptimal so-lutions as they execute a robust global search. Com-pared with greedy methods, this results in the im-

∗Correspondence to: [email protected]

Faculty of Electrical Engineering and Computer Science, Universityof Maribor, Maribor, Slovenia

DOI: 10.1002/widm.1079

proved handling of attribute interactions.6 EAs areinspired by the principle of natural evolution. Theyuse a population of individuals (candidate solutions)that evolve through generations while being subjectedto exchange of genetic material (reproduction), mu-tation, and pressure to adapt to its environment (se-lection with regard to fitness). Each individual is eval-uated according to a fitness function and at eachgeneration fitter individuals have a higher probabil-ity of advancing to the next population and repro-ducing. EAs are increasingly being used to evolveDTs because they provide accurate solutions and areable to maintain comprehensibility. Comprehensibil-ity can be assured by evaluating candidate solutionswith regard to both the accuracy and the size of thetree.

This paper gives a general overview of theevolutionary approach to DT construction by de-scribing the most relevant achievements and tech-niques in this field and by applying the describedmethods in an informative and easy to follow casestudy; for more detailed technical description anddiscussions, other papers are available.1 The paper

Volume 3, March /Apr i l 2013 63c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

Marital status

Education Capitalgain

Divorced, Never married, Separated,Widowed, Married Spouse Absent,

Married AF spouse

Married Civ Spouse

>50k <50k <50k >50k

<8.400 ≥ 8.400Assoc voc, 12th, 11th, 10th, 9th, 7th 8th,

5th 6th, 1st 4th, Preschool

Doctorate, Masters, Bachelors, Some college, HS grad, Prof school, Assocacdm

FIGURE 1 | An example of a DT (predicting whether a person makes more than 50k, from Adult dataset11).

is structured into five content sections followed bythe conclusions. In the first two sections, we introduceDT classifiers and EAs. Next section focuses on evo-lutionary construction of DTs, after which techniquesfor performance improvements are presented. Finally,the last section represents a case study of evolution-ary DT construction for the University of California,Irvine (UCI) Adult dataset.11

DECISION TREES

DT is a typical representative of a symbolic machinelearning approach used for the classification of objectsinto decision classes, where an object is representedin a form of an attribute-value vector (attribute1,attribute2, . . . attributeN, decision class).7–10 Object isdescribed by attributes (sometimes called fields, vari-ables, or features), which are usually identified andselected by domain experts. The decision class is aspecial attribute whose value is known for the ob-jects in the training set, and which will be predictedaccording to the induced DT for all further objectswith unknown decision class. Normally, the decisionclass is a feature that could not be measured (e.g.,some prediction for the future) or a feature whosemeasuring is unacceptably expensive, complex, or notknown at all. Examples of attributes-decision classobjects are: patient’s examination results and diagno-sis, bank client’s profile and credit loan decision, pastand present weather conditions, and the weather fore-cast, business operations data, and business decision.

DTs can be used for two types of problems: clas-sification (the decision class is a discrete variable—alabel or category to which the data belongs) and re-gression (the decision class is a continuous variable).In a DT, leaves represent classifications and branches

represent conjunctions of features that lead to thoseclassifications. An example of a DT predicting is pre-sented in Figure 1 (the Adult dataset is used in thisexample, the explanation is presented below). Themachine learning technique for inducing a DT classi-fier from data (training objects) is called DT learningor DT induction.

The main goal of induction is to build a clas-sification (or regression) model that can be used forprediction.12 Classification is thus a process of map-ping instances (i.e., training or testing objects) repre-sented by attribute-value vectors to decision classes.The aim of DT learning is to induce a DT modelthat is able to accurately predict the decision class ofan object based on the values of its attributes. Theclassification of an object is accurate if the predicteddecision class of the object is equal to the actual de-cision class of the object. The DT is induced usinga training set (a set of objects where both the val-ues of the attributes and decision class are known)and the resulting DT is used to determine decisionclasses for unseen objects (where the values of the at-tributes are known but the decision class is unknown).A good DT should accurately classify both the giveninstances (a training set) and other unseen instances (atesting set).

DTs exhibit a wide range of applications,but they are especially attractive in data mining.12

Because of their intuitive representation, the re-sulting classification model is easy to understand,interpret, and criticize.4 DT can be constructedrelatively fast compared with some other methods.13

And last, the accuracy of DTs is comparable to otherclassification models.13,14 Every major data miningtool includes some form of DT model constructioncomponent.15

64 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

The Classical (Statistical) Induction of DTsClassically, a DT is built from a set of training objectsaccording to the ‘divide and conquer’ principle. Whenall objects are of the same decision class (the value ofthe output attribute is the same) then a tree consistsof a single node—a leaf with the appropriate deci-sion. Otherwise, an attribute is selected and a set ofobjects is divided according to the splitting functionof the selected attribute. The selected attribute buildsan attribute (test) node when growing the DT classi-fier, and for each outgoing edge from that node, theinduction procedure is repeated upon the remainingobjects regarding the division until a leaf (a decisionclass) is encountered.

In the 1960s, Hunt et al.16 introduced ConceptLearning System (CLS) that used look-ahead heuris-tic to construct trees. CLS was a learning algorithmthat learned concepts and used them to classify newcases. CLS was the precursor to DTs, and it led toQuinlan’s ID3 system. ID35 added the idea of usinginformation content to choose the attribute to split.Quinlan later upgraded ID3 with an improved indus-trial version C4.53 that is still regarded as the refer-ence model to build a DT based on the traditionalstatistical approach. However, there are many meth-ods for constructing DT classifiers, various splittingcriteria, and pruning methodologies.17

Variants of DTsBoth algorithms ID3 and C4.5 represent the mostcommon DT approach—the univariate classificationtrees. They use information gain of a single attributeto build a DT. In this manner, an attribute that addsthe most information about the decision upon a train-ing set is selected first, and the next one selected is themost informative from the remaining attributes, untila subset of training data is clear enough to be classi-fied with a specific decision class in a leaf node of theDT.

Although the univariate classification trees areused in most applications, there are some extensionsand variations of this concept. Unlike a univariateDT, a multivariate DT18 is not restricted to splitsof the instance space that are orthogonal to the fea-tures’ axes. Utgoff’s ID519 was an extension of ID3that allowed many-valued classifications as well asincremental learning. In 1984, Breiman et al.4 intro-duced CART, which, in general, uses the same basicalgorithm as Quinlan in ID3 and C4.5. At the de-cision node level, however, the algorithm becomesextremely complex. CART starts out with the bestunivariate split. It then iteratively searches for pertur-bations in attribute values (one attribute at a time),which maximize some goodness metric. At the end

of the procedure, the best oblique and axis-parallelsplits found are compared, and the better of these isselected. CART also allowed the induction of a spe-cial kind of DTs, namely the regression trees wherea tree leaf is a continuous-valued attribute and nota discrete attribute (decision class) as in more tra-ditional classification trees. A particular case of DTsemployed to solve regression problems are also modeltrees,20 which have the advantage of presenting aninterpretable output with an acceptable level of pre-dictive performance.

EVOLUTIONARY ALGORITHMS

Many computational problems require a searchthrough a huge solution space to find potential so-lutions. To achieve good performance in a nonstaticenvironment, some problem solving computer pro-grams must be adaptable or able to invent new, orig-inal solutions. Biological evolution is an inspirationto methods that address these requirements. Evolu-tion can be regarded as a search method: possiblesolutions are organisms that are able to survive andreproduce in the given environment. The diversity ofthe population is achieved through reproduction andmutation of individuals, while surviving within theenvironment serves as the evaluation of the fitness ofeach individual. Individuals better adapted to the en-vironment live to reproduce, whereas others becomeextinct.

In computer science, EAs were introducedduring 1950s and 1960s. Representatives ofEAs are genetic algorithms (GAs),21,22 geneticprogramming,22 evolutionary programming,23 andevolution strategies.24 A standard GA is executed asfollows: an initial population of individuals is gener-ated. Individuals in the population evolve by repeat-ing the process of selection, cross-over, and mutationat each generation. The fittest candidates are selectedinto the next generation and are able to reproduceand mutate. The process stops when either the soughtsolution is found or a predetermined number of gener-ations has been reached. Stopping condition is one ofthe control parameters that guide the evolution pro-cess. Control parameters are essential for a successfulconvergence of evolution and they include: popula-tion size, selection strategy, generation gap, cross-overrate, and mutation rate. Values selected for the con-trol parameters depend on the type of problem beingsolved, its representation and complexity.

Rationale for Using Evolutionary MethodsFact is that deterministically derived DTs do not al-ways achieve best solutions. The interactions that

Volume 3, March /Apr i l 2013 65c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

(data preparation)

Induction

Postinduction

Preinduction

(optimization)

Dataset preparation:- feature selection and reduction- subsampling- oversampling- ....

DT induction:- attribute selection- discretization- test selection (in-node)- ....

Tree optimization:- pruning- ....

FIGURE 2 | The general process of DT construction.

govern complex natural systems prevent simple one-by-one mathematical inference of tests, as practicedby most DT induction algorithms. To illustrate why,consider a game of chess.

Current state-of-the-art chess playing algo-rithms evaluate the position after every move for sev-eral moves in advance. However, the combinatorialexplosion prevents them to evaluate every possibleposition (if this were possible, human masters couldnever beat computers; even more, the winner wouldbe known before the start of the game). Consequently,they must decide to ignore the ‘unpromising’ and fo-cus on the remaining game paths. The problem liesin the decision which game unfolding to inspect andwhich to ignore. A mistake here and the algorithmwill overlook the path to an otherwise winning posi-tion.

The same is true in the DT induction process.The choice of a suboptimal test will influence the dis-tribution of training samples later on. It is impossibleto computationally create the best test by inspectingonly a subset of resulting subtrees. But it is also impos-sible to evaluate all possible resulting subtrees there-fore some kind of a compromise must be made. EAs,on the contrary, build the complete tree first, onlythen it is inspected for performance. This approachis unbiased and free to achieve otherwise overlookedsolutions.

Evolutionary DT construction differs from de-terministic induction in many aspects. On the down-side, it is slower; it is governed by a plenty of controlparameters and is very susceptible to their values; itneeds an experienced human operator to fine-tune the

fitness function and all the parameters; because of theunderlying random mechanisms, it produces resultsof different quality in different runs. Deterministicapproaches, however, can get stuck in local optima,are unable to produce different results if needed, andare more susceptible to noise and errors in the trainingdata.

CONSTRUCTION OF DTs WITH EAs

To be able to use EAs for the construction ofDTs one must be familiar with both processes. Thegeneral process of DT induction is presented inFigure 2. There are three main phases of the pro-cess: preinduction, where data is prepared to be usedfor the induction and evaluation of DTs, induction,where DTs are actually being induced using the pre-pared data, and postinduction, where the inducedDTs are being optimized.

The general process of solving a problem withEAs is presented in Figure 3. It starts with the setting-up the environment, where the representation of anindividual within the evolutionary process is beingdefined first, then the genetic operators are chosenand finally control parameters are being set. After thesetup, the evolutionary process can be started by con-structing the initial population of individuals. Thenthe evolutionary cycle starts: each individual is beingevaluated using the defined fitness function, selectionis performed based on fitness evaluation, selected in-dividuals undergo cross-over and mutation produc-ing offspring, which represent the next generation of

66 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

Set-upPhase

InitialPopulation

Fitnessfunction (FF)

Preparation:- represenation- evolutionary parameters- ....

Population preparation:- initial (random) DT induction- ....

GeneticOperators

Termination

FF used:- multi-objective FF- cost sensitive- ....

EA operators used:- crossover, mutation

Termination criteria- FF threshold, - generation count- time - ....

Selection

Selection used:- tournament- roulette wheel- ....

FIGURE 3 | The general process of EAs (for DT construction).

individuals. The evolutionary cycle is repeated untilthe defined termination criteria are met. This wholeEA process of inducing DTs is described in detail inthe following sections (see Box 1).

RepresentationSearch and optimization algorithms require an ap-propriate representation of individuals to the prob-lem they are solving. The representation (encoding) isimportant as it influences the performance. A repre-sentation can be either direct or indirect.25 A directrepresentation means that individuals are in their na-tive format. The main benefit of using the direct repre-sentation is that the complexity of the problem is notexpanded; the representation does not add any addi-tional limitation or redundancy to the problem. Thedrawback is that direct representation usually do notallow for standard genetic operators (cross-over, mu-

tation). Direct representations for nontrivial individ-uals require problem-specific search operators. Indi-rect representations allow the introduction of sensibleadditional constraints (possibly introducing problem-specific knowledge) and allow standard genetic oper-ators. The representation (direct or indirect) shouldbe a natural representation of the problem.26

DTs have an obvious direct representation: treedata structure. As the tree is a common data structurein computer science, it is often used to represent DTs.Figure 4 represents the direct representation of a DTas a tree data structure where each node has four val-ues: node type (test/leaf node), testing attribute (testnode) or decision class (leaf node), splitting function(testing operator and value). Direct representationsof DTs do not allow the use of standard genetic op-erators; however, modifications to their implementa-tions are not too demanding. Therefore, direct rep-resentations of DTs are a natural representation of

Volume 3, March /Apr i l 2013 67c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

FIGURE 4 | A decision tree’s direct representation as a tree data structure. A node is a quadruple of values: node type, testing attribute ordecision class, and splitting function (testing operator and testing value).

individuals with manageable effort required for thechanges to genetic operators. In literature, direct rep-resentations of DTs are often used.27–29

Indirect representations are also common,30,31

as they allow the introduction of sensible additionalconstraints (possibly introducing problem-specificknowledge) and allow standard genetic operators.Indirect representations include Prufer sequence,32

predecessor representations,33 characteristics vectorencoding,34,35 link, and node biased encodings.32 Anapproach to constructing binary DTs using an indirectrepresentation was presented by Cha and Tappert.30

The encoding transforms the binary trees to a stringrepresentation composed of integers. The attributenames in the tree are converted into an index of at-tribute encodings ordered according to attribute list.

Initial PopulationThe first step in the evolutionary process is the gen-eration of the initial population of individuals. Theprocess of generating the initial population dependson the encoding. If the encoding allows for infeasiblesolutions, then each generated individual of the firstpopulation needs to be verified for its feasibility. Mostcommon technique for generating initial population isthe random generation of individuals. It resembles theprocess of deterministic DT induction, described ear-lier, but all the attributes and their splitting functionsare chosen randomly rather than by some statisticalmeasure. However, problem-specific knowledge canbe used to seed the initial population with solutionsthat have been proven good and/or are previouslyknown for the solved task. This can shorten the con-vergence of the evolutionary approach significantly

but it brings the inherent risk of the algorithm be-ing stuck in local optima therefore never finding theglobally optimal solution.

In practice, the first generation is very often cre-ated randomly. The procedure for the generation ofthe first population is straightforward: generate a ran-dom tree with a predetermined depth (optional con-trol parameter), randomly select the attributes of thenode tests (for each node in tree) as well as the split-ting functions. Repeat the process of tree generationuntil a sufficient number of trees has been generated(determined by the population size). Approaches areknown where the distance from the root to any leafnode is the same36,37 or varying38 within generatedDTs. Another approach is to restrain the depth of DTsto two (root node and its leaves).29

Various strategies for improving the totally ran-dom generation of the node test exist. For exam-ple, Kretowski and Grzes27 generates by preprocess-ing the continuous valued attributes by calculatingthe boundary thresholds. This significantly limits thenumber of possible splits therefore making the en-tire algorithm faster and more robust. The individualsin the population are generated by first applying theclassic top-down algorithm, which chooses test in adipolar way. Two objects are randomly chosen fromfeature vectors located in considered nodes. Then atest that separates two objects into subtrees is createdby considering only attributes with different featurevalues. The divisions are recursively repeated untilstopping conditions are met. The final generated tree(individual of the initial population) is postpruned onthe fitness function.

E-motion28 initial population generation istwofold: first the basic trees are based on each dataset

68 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

attribute. For each numerical attribute of the dataset,five different basic trees are generated; the first usesstandard deviation reduction of the entire training setto define the threshold value, the other four are basedon four partitions of the training set. For categoricalattributes, the tree is composed by a root node thattests over the given attribute and an edge for eachattribute category. The second step is the aggrega-tion of different basic trees whose maximum depthvalue is set by the user: E-motion randomly combinesthe basic trees to create trees with limited maximumdepth.

Fitness FunctionThe most critical point of any evolutionary processis the estimation of the quality of the individual DT,which is typically done by the so-called fitness func-tion. Each DT’s associated fitness determines whichDTs survive into the next generation. Fitness is crit-ical because assigning a bad fitness to a good DThinders further evolution of this particular tree andresults in investing time in other, not so good DTs.Consequently, we end up dealing with worse trees’offspring instead of searching among good trees’ chil-dren. Fitness mistakes therefore waste the resources(CPU, memory) and delay the already slow evolutioneven more. Moreover, bad fitness prevents EA fromrecognizing even otherwise perfect DT(s).

The most common fitness function simultane-ously observes two conflicting goals—classificationaccuracy and DT simplicity. Classification accuracyof a given individual (candidate DT) on the trainingset is, by definition, an imperfect approximation of thequality measure that we really want to maximize—the predictive accuracy on the new or previously un-seen data. This brings forward the second property,namely the DT’s size. By optimizing for smaller trees,we hope to follow the principle of Occam’s razor andproduce trees that are able to generalize/perform wellalso on previously unseen data.

Because the two properties are conflicting (onlya large tree can classify every instance in the trainingset correctly and small tree is more likely to be moregeneral), a delicate balance between the two proper-ties is needed. Many authors use weights to tailor thefitness to particular problem. For example, Zormanet al.39 also used special self-adapting weights.

With any fitness function there is always dan-ger of producing DTs with good fitness score but ofsmall practical value. For example, fitness function re-warding small trees will assign the best reward to anempty tree! Or in the case of maximizing accuracy—the false reward easily goes to one-leaf-majority-class

trees. Evolutionary search requires fitness with a sta-ble, continuous balance between classification accu-racy and DT complexity.40

For this reason, the fitness function for DT evo-lution is in essence a multiobjective function, whichtransforms two or more separate assessments of qual-ity of respective objective into one scalar grade. Merg-ing is necessary for the subsequent phase of selection.In general, although multiobjectivity results in a setof undominated DTs, where each tree is such that itcannot get better in one objective unless it gets worsein another. The claim by many authors that their par-ticular MultiObjectiveEA implementations are suc-cessful imply the associated fitness functions areappropriate for the given problem domains.41 Multi-objective optimization for cost-sensitive classificationsystem is used in Ref 37; evolution of model trees withlinear-regression models in their leaves using distinctmultiobjective optimization strategies is described inRef 42.

Additional issue when calculating fitness is thecost. Typically, cost is associated with the executionof certain tests required by the DT to make the classi-fication. Another type of costs relates with erroneouspredictions. Evolution can optimize the solution withrespect to such costs only if the costs are incorporatedinside the fitness function. Typical solution is then tointroduce particular costs for particular types of er-rors. In Ref 39, costs were associated with the use ofcertain attributes in the decision process.

To simultaneously optimize the conflicting goalsof accuracy and simplicity, several approaches areused in practice—for example, the (conventional)weighted formula,39 the Pareto approach,43 lexi-cographic multiobjective approach,44,45 and others,each with its own strengths (simplicity) and weak-nesses (the ‘magic number’ problem of weights,adding up noncommensurable criteria such as accu-racy and tree size, and mixing different units of mea-surement, . . .).46 To push search in desired directions,special costs are introduced by many.47

SelectionThere are three basic components used to evolve apopulation of individuals: selection, cross-over, andmutation. Selection is a method of choosing individ-uals for reproduction; it promotes fitter individuals(according to fitness function) to be transferred tothe next generation with the intention of further im-proving the fitness of their offspring. Cross-over isused to exchange genetic material of (normally two)individuals to propagate their characteristics withinpopulation. Mutation is used to alter a (rather small,

Volume 3, March /Apr i l 2013 69c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

FIGURE 5 | An example of the cross-over operator on a tree-based representation of a DT.

i.e., a gene) part of an individual to enhance geneticdiversity within population. Although the selection isusually independent of the chromosome (a represen-tation of an individual), both cross-over and mutationare defined in accordance with the representation ofan individual within evolving population of DTs.

Selection is the process of choosing individualsthat will undergo cross-over and mutation. Selectionis based on fitness value of individuals as the fittestindividuals have the best chance to be selected for re-production. Selection is one of the most independentparts within an EA, as it is independent of the domain.This means that basically the same selection operatorcan be used regardless of the rest of the evolution-ary process and also regardless of the application. Inthis manner, for evolutionary construction of DTs,the same selection operators are used as for the ma-jority of other applications of EAs. The main differ-ence between selection operators is how they define achance of an individual to be selected for reproduc-tion based on its fitness. The most frequently usedapproaches are tournament selection, roulette wheelselection, and rank-based selection. Selection shouldbe balanced with other operators: if the selection pres-sure is too high, the diversity of the population will bereduced (the solution converges to a local optimum);

if the pressure is too low, the evolution tends to bevery slow (many generations with little progress).

Genetic Operators: Cross-over andMutationCross-over operator is used to exchange genetic ma-terial between two chromosomes. It imitates sexualreproduction in nature and produces new chromo-somes (offspring). Basic cross-over implementationson a fixed-length string representation work by se-lecting two ‘parents’ from the current generation thenrandomly select a position within the parents and ex-change the genes. This is called a one-point cross-over;other well-known types include a two-point cross-over (a substring with start and end point is selectedwithin each parent), and a uniform cross-over (par-ents exchange information at the gene level with mul-tiple point cross-over).

The most common cross-over operation ona tree-based representation of a DT is to ran-domly select nodes in two individuals and thenexchange the entire subtrees that correspond toeach selected node (see Figure 5). In this manner,one or two offspring are generated. This kind of

70 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

cross-over is used in the majority of existing ap-proaches, including.29,37,38,40,47–49

Variants of this cross-over approach exist,which usually define some restrictions on when andhow the cross-over is performed. In Ref 50, authorsintroduce cross-over that selects a randomly selectedtraining object that is then used to determine the paths(by finding a decision node through the tree) in bothselected individuals. Then, an attribute node is ran-domly selected on a path in the first individual and anattribute is randomly selected on a path in the secondindividual. Finally, the subtree from the selected at-tribute node in the first individual is replaced with thesubtree from the selected attribute node in the secondindividual providing one offspring. This approach hasbeen adopted and slightly changed also in Ref 51. InRef 36, authors introduce a ‘test-only exchange cross-over’ in which instead of replacing the entire subtrees,only the test that is represented by an attribute-valuepair is replaced. This type of cross-over demands thenumber of outcomes of the selected nodes to be equalto preserve the valid tree structure.

Mutation operator is used for reclaiming diver-sity in the population, and it is an instrument of in-novation and variation. Mutation changes individualgenes in an individual (chromosome) and can thusbring new information into the population (or can re-animate lost information from previous generations).Mutation differs from cross-over in that it is usedfor local search: it can only change some propertiesof an individual and cannot combine properties frommultiple individuals. The mutation is not applied onall individuals—whether it is applied or not is deter-mined randomly, based on the values of the mutationprobability parameters.

The most common mutation operations on atree-based representation of a DT are:

• Change of an attribute node (one attribute israndomly replaced by another),

• Change of a split function (a splitting func-tion in an attribute node is changed; in thecase of axis-parallel DTs, this means that dif-ferent intervals are chosen randomly, whichdistribute all the instances into subtrees fromthe attribute node),

• Change of a leaf/a decision node (a decisionclass is randomly changed into another),

• Change of an attribute node into decisionnode (a randomly chosen attribute node ischanged for a decision node; all the subtreesfrom this node are subsequently deleted), and

• Change of a decision node for an attributenode (a randomly chosen leaf is replaced by anewly generated subtree).

The above mutation operations (or some sub-set of them) are used in the majority of existing ap-proaches on a tree-based representation of a DT, in-cluding Refs 36–38,45,47,52.

Some possible mutations (a change of a singleattribute node, a change of split function value, and achange of a single decision node into a new attributenode) on a tree-based representation of DT is depictedin Figure 6.

The most common mutation operator on afixed-length string representation of a DT is an alter-ation of a randomly chosen gene, which may changethe attribute being used, its test value or both, depend-ing on the approach, but usually does not change thetree structure.

With the combination of cross-over and mu-tation the evolutionary search for the solution is di-rected toward the globally optimal solution. The glob-ally optimal solution is the best DT regarding the usedfitness function. As the evolution repeats, more qual-itative solutions are obtained regarding the chosenfitness function. The evolution stops when the termi-nation criteria have been satisfied.

Termination CriteriaThe termination (stopping) criterion is a decisionabout when to stop evolving. Although not criticalto the evolution process itself, it is an important partof the EA. It is typically bound to:

• Quality of DT(s), and/or

• Availability of resources.

The quality of evolved DTs should be visiblefrom its fitness score and typical termination crite-rion is therefore the fitness threshold. Because fitnessis problematic, other approaches are used to assessthe achieved quality. Mainly they are governed bya human, who is given some graphic representationof the progress in separate objectives (e.g., accuracy,size, diversity, cost . . .). Of course a multiobjectivetermination rule can also be formed.

Termination because of limited resources typ-ically observes resources like time (real time clock,CPU cycles, generations, offspring . . .) and memory(storage, population size . . .). Evolution is a mul-tipoint search method and as such consumes a lotof resources. Also, genetic operators and fitness cal-culation can be quite time challenging operations

Volume 3, March /Apr i l 2013 71c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

FIGURE 6 | Some examples of the mutation operator on a tree-based representation of a DT.

therefore time is the main termination criterion inthis group.

A hybrid stopping criterion, for example, is trig-gered if there is no progress in fitness in the specifiedamount of time/number of generations.

Artificial evolution often includes qualitativejumps that are impossible to predict. Terminationproblem is, just like fitness, a multiobjective problem.Evolution that lasts too long will probably produceDTs that are overfitted to the training data. Evolutionthat terminates too early, however, could still be faraway from the best achievable solution.

ENVIRONMENT SETUP FORPERFORMANCE IMPROVEMENT

Performance of machine learning approaches can beimproved by using general (independent of a methodused) and specific procedures (which depend on theapplied method). General procedures include bothpre- and postinduction, whereas specific procedures—in case of evolutionary construction of DTs—involvefine-tuning of the control parameters.

Preinduction: Preparing the DataDT construction, being a representative of symbolicmachine learning approach, is a specific step in an

overall process of knowledge discovery in datab(Figure 7). To date, most modern data mining toolshave focused almost exclusively on building models.53

Yet enormous dividends come from applying themodeling tools to correctly prepared data, althoughpreparing data for modeling can be time-consumingprocess, traditionally carried out by hand and veryhard to automate.

The phases of the knowledge discovery in dataprocess are: identification of possible data sources,feature selection, data cleaning and preprocessing,model induction (the construction of DT model inour case), and finally interpretation and evaluation ofobtained models.

The main goal of three preinduction phases(identification of data sources, feature selection, anddata cleaning and preprocessing) is to prepare an ac-curate and informative dataset of high quality, as thequality of the induced model can only be as high asthe quality of the data used for the induction.

A very critical aspect of evolutionary construc-tion of DTs is the balance between exhaustivenessand conciseness within the dataset, both in regard tonumber of attributes and number of cases. Althoughmore data increases the possibility of including thehidden patterns (knowledge) within the data, it alsoincreases the search space. On the contrary, the re-duction of data increases the possibility of finding

72 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

DB

Data sources

Feature

selectionCleaning and

preprocessingDT

induction

Interpretation

and evalution

Data Preprocessed

data

DT model

Knowledge

FIGURE 7 | The overview of the knowledge discovery in data process.

those patterns (knowledge) within the data, but alsoincreasing the danger of eliminating important pat-terns from the data and thus decreasing its informa-tive value for prediction.

Postinduction: Optimizing DTsThe structure of directly evolved DTs is mostly fullof unnecessary nodes. Many are of cosmetic natureand do not influence the decision process and can besolved easily. Most frequent problems are the pres-ence of extraneous nodes—the introns—and the pres-ence of duplicated leaves.

Introns are segments of a DT that do not con-tribute to the decision process and can be removedwithout changing the DT’s behavior. During the earlyand middle part of the run, introns may well havebeneficial effects as they protect against destructivecross-over; later on they result in genetic bloat, whichis preventing further evolution.

An example intron is depicted in Figure 8, wherethe Capital-gain and Sex subtree is irrelevant for thedecision process as it is never used due to the mutualexclusiveness of the two Capital-gain tests. The intronnodes (tests and leaves) can be completely removedfor better clarity and human comprehension. Notethat the two leaves inside the intron segment haveno associated class (marked with ‘?’) as no trainingsamples have traversed the tree to reach them, hencethe decision is impossible and unnecessary.

Next is the issue of duplicated leaves depicted onthe hatched surface of Figure 8. The question askedproduces the same decision regardless of the actualgender of the person. Consequently, the test and both

leaves can be replaced with one leaf returning the‘≤50k’ decision.

Execution of both postprocessingoptimizations—removal of introns and simplifi-cation of a subtree—effectively replaces both markedsubtrees with a ‘≤50k’ leaf.

In practice, more complex procedures thatchange DT’s structure are common. They include DTpruning, a technique similar to removal of duplicatedleaves. Pruning removes certain nodes and thereforereduces the tree size with a hope to increase general-ization capability; many tests and leaves are namelyoverfitted to training data. For example, imagine atest that routes the arriving (training) samples intotwo leaves; to first 200 samples of class ‘>50k’ aresent, to second only one sample of class ‘≤50k’. Prun-ing replaces such a test with a leaf that has a majority(200:1) decision of ‘>50k’. Pruning itself is a com-plex issue and is difficult to perform quickly enoughfor on-the-fly pruning during evolution run, what canbe very beneficial for the evolution results.49

Although pruning and other postprocessingtechniques can reduce DT’s size, they do not guar-antee the smallest possible tree. It is believed that re-dundancy control with (explicit) code simplificationduring evolution run results in more compact andbetter solutions.54

Control ParametersMost common parameters that control the per-formance of EAs are: population size, mutationrate, cross-over rate, generation gap, and selectionstrategy. Numerical parameters are compared in

Volume 3, March /Apr i l 2013 73c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

Capital gain

>50kCapital gain

<3675 >3675

Sex

? ?

Female

Sex

Male Female Male

>3675<3675

? ?≤50k ≤50k

FIGURE 8 | Introns and duplicated leaves.

TABLE 1 A Comparison of Numerical Control Parameters with Regard to Range, Low and High Values

of Each Parameter56

Operator Effect Range High Value Low Value

Population size Performance and efficiency ≥2 Prevents prematureconvergence tononoptimal solution

Poor performance,insufficient coverage ofsearch space

Cross-over rate Diversity of the population 0–100% Better solutions arediscarded too quickly

Search stagnates

Mutation rate Secondary operator ofdiversity

0–100% Increasingly random search Insignificant changes toindividuals

Generation gap Part of population replacedat each generation

0–100% Value 100 means none of theindividuals are replaced

Value 0 means all individualsare changed

Table 1 according to the effect on the GA with re-gard to higher and lower values of each parame-ter. The minimal value of population size is two(to accommodate cross-over), however, lower val-ues mean poor performance. Population size is di-rectly correlated with the size of the problem space.Both diversity operators (cross-over, mutation) andthe generation gap are all in the range from 0%to 100%. The value of these operators refers tothe number of individuals that are affected by eachoperator. Common selection policies include fitnessproportionate selection, tournament selection, andrank-based selection and potentially include elitism(top k fittest individuals are promoted to the nextgeneration).55

CASE STUDY: INDUCINGEVOLUTIONARY DT ON ADULTDATASET

In this section, we present a case study where thewhole process of inducing a DT using EAs is ex-plained. For the case study, we will use Adult datasetfrom UCI Machine Learning Repository.11 The Adultdataset was extracted in 1994 from census data of theUnited States. It contains six continuous and eightnominal attributes, describing some social informa-tion about the citizens registered. The prediction taskrelated to this dataset is to determine whether a per-son characterized by 14 attributes like education,race, occupation, marital status, and so on, makes

74 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

TABLE 2 Attributes of the Adult Dataset

Attribute Name Possible Values

Age Continuous: [17–90]Work class Private, Self-empnotinc, Self-empinc, Federal gov, Local gov, State gov, Without pay, Never worked.fnlwgt Continuous: [12,285–1,484,705]Education Bachelors, Some college, 11th, HS grad, Prof school, Assoc acdm, Assoc voc, 9th, 7th 8th, 12th, Masters, 1st

4th, 10th, Doctorate, 5th 6th, Preschool.Education num Continuous: [1–16]Marital status Married civ spouse, Divorced, Never married, Separated, Widowed, Married spouse absent, Married AF spouse.Occupation Tech support, Craft repair, Other service, Sales, Exec managerial, Prof specialty, Handlers cleaners, Machine

opinspct, Adm clerical, Farming fishing, Transport moving, Priv house serv, Protective serv, Armed Forces.Relationship Wife, Own child, Husband, Not in family, Other relative, Unmarried.Race White, AsianPacIslander, AmerIndianEskimo, Other, Black.Sex Female, Male.Capital gain Continuous: [0–99,999]Capital loss Continuous: [0–4356]Hours per week Continuous: [1–99]Native country United States, Cambodia, England, Puerto Rico, Canada, Germany, Outlying US (GuamUSVIetc), India, Japan,

Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal,Ireland, France, Dominican Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala,Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinadad & Tobago, Peru, Hong, Holand Netherlands.

BOX 1: DTs AND EAs

To replace classic induction of DTs with EAs, the individ-uals (DTs) need to be represented in a way that allowsthe evolutionary process. In general, any problem can beapproached (solved) with EAs as long as individual’s repre-sentation can be handled by genetic operators (selection,cross-over, mutation) and fitness function allows rankingof individuals according to their quality. DTs can be repre-sented directly (as a classic tree data structure) or indirectly(e.g., fixed-length string). Direct representation requires themodification of standard operators (cross-over and muta-tion). These modifications, however, are straightforward toimplement and are thus often used. The fitness functionis essential for the evaluation of quality of each individualand consequently for convergence to a globally optimalsolution. As DTs are classifiers, the fitness function is ob-vious: classification accuracy. Often, the fitness function ismultiobjective (accuracy, size, cost, . . .).

over $50k/year or not. The attributes are described inTable 2. It can be seen that several nominal attributeshave a lot of possible values, which may represent adifficulty for the construction (especially when small,compact DTs are sought for).

The dataset consists of 48,842 data instancesaltogether, which have been originally divided intotraining set (32,561 instances) and testing set (16,281

TABLE 3 Comparison of Training and Testing Set for the Adult

Dataset

Training Set Testing Set

Instances 32.561 16.281Missing values 4.262 2.203Class distribution: >50k; ≤50k 24.08; 75.92 23.62; 76.38

instances); for the purpose of comparison with otherclassification algorithms, this original division hasbeen used also in our experiment. The basic com-parison of training and testing set is presented inTable 3.

For this experiment, genTrees algorithm hasbeen used,56 but basically any of the existing imple-mentations could have been chosen.

Setting Up the Environment for GAFirst, the environment for the GA has to be set up.In this manner, an individual representation (a geno-type) should be defined together with the genetic op-erators (selection, cross-over, and mutation) and thevalues of the genetic parameters for evolutionary pro-cess should be set.

By choosing the genTrees algorithm, we decidedto represent an individual in the most common way:a direct representation where an individual is rep-resented as a binary DT. Consequently, the binary

Volume 3, March /Apr i l 2013 75c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

2

11

20

290

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80

1,4–1,6

1,2–1,4

1–1,2

0,8–1

0,6–0,8

0,4–0,6

0,2–0,4

0–0,2

2

5

8

11

14

17

20

23

26

29

32

35

20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

1,5–2

1–1,5

0,5–1

0–0,5

Tree size (number of nodes)

Accuracy

Accuracy Tree size

FIGURE 9 | The fitness surface of initial population (generation 0) with regard to accuracy and tree size (note: the lower fitness is the better).

TABLE 4 The Setting Up of Genetic Parameters

Population size 120Elitism value 1Tournament size 3Mutation attribute prob 0.13Mutation split prob 0.67Mutation decision prob 0.09Mutation attribute to decision prob 0.51Mutation decision to attribute prob 0.39Weight for number of nodes (wnn) 0.0003Weight for number of unused nodes

(wnu)0.0000

Weights for class accuracies (wi ) w1 = 0.61; w2 = 1.51

DT representation allows us to use the very commoncross-over and mutation operators, which will be ex-plained later. As a binary DT is already a final solu-tion to our problem (DT construction), no conversionfrom genotype to phenotype is needed.

After defining the genotype and genetic opera-tors, the values of the genetic parameters should be setup. Although there are some basic recommendationsof how the genetic parameters should be set,57 thefinal setting still depends on experimentation and ex-periences. For our experiment, we defined the geneticparameters as presented in Table 4.

Inducing the Initial PopulationThe genTrees algorithm induces binary DTs. The ini-tial population of 120 individuals (generation 0) is in-duced randomly, according to the algorithm describedin the section Initial Population.

Fitness FunctionIn our experiment, we relied on the genTrees’ fitnessfunction:

F F =K∑

i=1

wi · (1 − acci ) + wnn · nn

where K is the number of decision classes (K = 2 inour case), acci is the per-class accuracy for a specificdecision class di (d1 is ‘>50k’, d2 is ‘≤50k’ in ourcase), wi is the weight for penalizing the misclassifi-cation of objects of a decision class di, nn is numberof decision (leaf) nodes that represents the size of atree, and wnn is the weight for penalizing the size ofa tree. Per-class accuracy describes the DT’s accuracywhen predicting one of the possible classes; contraryto the overall accuracy the per-class accuracy doesnot take into account all records, just records of therespective class. We must note that the above definedfitness function is a penalizing one, which means thatits value represents a penalty score for a tree and thusthe best fitness is the lowest one.

As it can be seen from the above fitness func-tion, the fitness of a DT in our case is determinedby accuracy of classification and size of the tree.It is interesting to see how the fitness of individualDTs is distributed across the population in regard toboth accuracy and size—Figure 9 shows this distri-bution for the initial population (generation 0) andFigure 10 shows this distribution for the populationevolved after 100 generations (generation 100).

Performance of a Binary ClassifierThe performance of a classifier is most easily de-scribed using a confusion matrix that summarizes thecount of different predictions according to the actualresults in the training data. Example matrix templatein Figure 11 uses Tcategory (Fcategory) to denote the countof True (False) classifications into respective category(‘≤50k’ or ‘>50k’). The Adult training set consists of32,561 records, thus the sum of all four cells shouldequal 32,561.

In this overview, we present four particular fit-ness functions. The first fitness function is the ac-curacy. As always with accuracy, higher scores arebetter.

f1 = T≤50k + T>50k

T>50k + T≤50k + F>50k + F≤50k.

76 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

2

11

20

2938

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80

0,9-1

0,8-0,9

0,7-0,8

0,6-0,7

0,5-0,6

0,4-0,5

0,3-0,4

0,2-0,3

0,1-0,2

0-0,1

2

5

8

11

14

17

20

23

26

29

32

35

38

41

44

20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82

0,8-1

0,6-0,8

0,4-0,6

0,2-0,4

0-0,2

tree size (number of nodes)

accuracy

accuracy tree size

FIGURE 10 | The fitness surface of population at generation 100 with regard to accuracy and tree size (note: the lower fitness is the better).

>50k ≤50k Actual class

>50k

<=50k

Predicted class>50k ≤50k

Actual

class

>50k

≤50k

T>50k F≤50k

F>50k T≤50k

FIGURE 11 | DT’s confusion matrix template.

The second function is based on DT’s errorrate but accounts also for costs of false categoriza-tions. The costs of making an erroneous classification,Cfalse≤50k and Cfalse>50k, are problem dependent. Wedecided for Cfalse≤50k = 3 and Cfalse>50k = 1 (the errorof falsely predicting low income is thrice as expen-sive as the opposite error). If Cfalse≤50k were equal toCfalse>50k, we would effectively produce the error ratefunction. The f2 is designed to assign low scores togood individuals:

f2 = 1T>50k + T≤50k + F>50k + F≤50k

× (F≤50kCfalse≤50k + F>50kCfalse>50k).

The third function is based on two properties:accuracy and size. To balance the ratio between thema special arbitrarily large constant X = 108 is used.This function assigns high values to good DTs.

f3 = (T>50k + T≤50k)2 + Xsize2 + X

.

The last fitness function, f4, is, of course, the fit-ness function of the genTrees algorithm. It uses threespecial weights—two for errors in per-class accura-cies and one for the size of the resulting tree. Sizeof the tree is measured using the number of internaltests, without the classifying leaves. As always, onehas to experiment with weights to see their effect onthe overall progress. We chose w>50k = 0.61, w≤50k= 1.51, and wsize = 0.0003.

f4 = w>50k · (1 − T>50k

T>50k + F≤50k)

+w≤50k ·(

1 − T≤50k

T≤50k + F>50k

)+ wsize · size.

Examples of Fitness CalculationWe demonstrate the calculation of last fitness functionon the DT in Figure 12. This DT was produced quiteearly in the evolution run and exploits two features:capital gain and capital loss. The 32,561 records fromtraining set are correctly predicted 2300 times forthe ‘>50k’ class (T>50k) and 24,131 times for the‘≤50k’ class (T≤50k). This DT makes 5541 (F≤50k) and589 (F>50k) erroneous predictions. The formula gives:f4 = 0.61 × (1 − 2300/(2300 + 5541)) + 1.51 × (1 −24,131/(24,131 + 589)) + 0.0003 × 3 = 0.467947.

Results for other three previously describedfitness functions are available in the last row ofTable 5; first row displays the confusion matrix andfitness scores of the simplest majority classifier—atrivial one-leaf DT—which predicts every record asbelonging to the ‘≤50k’ class. This majority classi-fier has a relatively high expected accuracy of almost76%. On the downside, all the rich adults are miscat-egorized as having low income.

We should note here that the fitness scores arenot interchangeable—we must always compare scoresof the same fitness function, not between functions.In our simplistic setup, all four fitness functions rec-ognized the majority classifier as inferior solution. Ingeneral, however, it can be problematic even for ahuman expert to pick a better tree—for example, isa 0.5% increase in accuracy worth introducing threeadditional nodes into a 90% accurate tree of size 30?

SelectionIn our experiment, we used the simple tournamentselection. The basic idea of tournament selection is

Volume 3, March /Apr i l 2013 77c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

Capital loss

>50k≤50k

>1794.672000<1794.672000

Capital gain

Capital gain

>50k≤50k

>41,999.580000<41,999.580000

<3999.960000 >3999.960000

FIGURE 12 | DT example.

TABLE 5 Confusion Matrix and Fitness Scores for Two Exemplary Decision Trees; Better Values in

Bold

Classifier T≤50k F≤50k T>50k F>50k f1 f2 f3 f4

‘≤50k’ 24,720 7841 0 0 0.759190 0.722429 7.110784 0.610000DT in Figure 12 24,131 5541 2300 589 0.811738 0.528608 7.985978 0.467947

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120

0

5

10

15

20

25

30

35

40

45

50

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5Individual

Fitness

Number of individualsFitness

FIGURE 13 | Fitness distribution within population before and after selection; no genetic operators have been used here yet (note: the lowerfitness is, the better).

to randomly choose N individuals from a popula-tion and the fittest one is selected for reproduction.In each evolutionary cycle, we always preserved theelite (some number of best fit solutions, only 1 in ourcase) unchanged for the next generation. In Figure 13,the distribution of fitness within a population of DTsbefore and after selection is presented; it can be seenhow selection forces better (more fit) individuals tobe used for further evolution.

Cross-over and MutationWhen two individual trees are selected for recom-bination, the cross-over operator is used on them toproduce offsprings. In our experiment, we used cross-over that selects a randomly selected training objectthat is then used to determine the paths (by finding adecision node through the tree) in both selected indi-viduals, as described earlier in section Genetic Oper-ators: Cross-over and Mutation.

78 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120

Fitness_0

Fitness_1000

Individual

Fitness

FIGURE 14 | Comparison of fitness distribution within two different populations: in initial population (generation 0) and after 1000generations (note: lower fitness is better).

After an offspring tree has been constructedfrom the two selected individuals, the mutation op-erator is applied on it based on the defined mutationprobabilities (see Table 4). In our experiment all fiveof the most common mutation operations on a tree-based representation of a DT has been used (based onthe given probability), as described earlier in sectionGenetic Operators: Cross-over and Mutation.

Termination and PostproductionWhen evolving DTs for the Adult problem space, wecan choose one (or a combination) out of several pos-sible termination criteria:

• Fitness threshold: for example, f1 ≥ 85%(achieved by C4.5),

• False classifications threshold: F≤50k + F>50k< 100,

• Average tree size ≥ 200,

• 1000 generations limit,

• No progress in 20 generations,

• 1 h of CPU time on the EA-dedicated server,

• 20 M evaluated DTs, and

• Intron explosion.

• . . .

A combined version of the termination crite-rion has been used in the experiment: the evolu-tion cycle has been repeated for 1,000 generationsand then it was stopped after 50 successive genera-

tions did not produce a better solution. A compari-son of fitness for all individuals in initial population(generation 0) and in population after 1,000 gener-ations is presented in Figure 14. It can be seen howthe quality of an entire population is improved withevolution.

After the evolution has been terminated, the bestindividual of the population has been used as the finalsolution. Postproduction in our case consisted of in-tron removal procedure—all the intron nodes (whichare never used in the classification process) have beenremoved from the tree.

RESULTS

As the evolutionary process involves randomness, asolution can be different in every run, even thoughthe same algorithm and the same settings of geneticparameters are used. In this manner, one always hasto execute several runs to evaluate the results. In ourexperiment, we executed 10 evolutionary runs usingthe described algorithm and parameters settings asshown in Table 4.

Several different solutions have been found,which differ slightly with regard to all properties (ac-curacy, tree size, used attributes, etc.). From thesesolutions, we selected one that has, at least in ouropinion, the best ration between complexity (numberof nodes, depth, number of different attributes used)and accuracy; it is presented in Figure 15. It has 19decision nodes (leafs) and its accuracy on test set (un-seen cases) is 85.63%.

Volume 3, March /Apr i l 2013 79c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

Capital gain

Education num

≥12<12

Capital gain

Age

Relationship

≥26<26

< 50k > 50k

<8.300 ≥8.300 Wife Husband, Unmarried, Own Child,OtherRelative, Not in Family

Hours per week

> 50k

Capital gain Relationship

Capital loss Capital gain> 50k

> 50kCapital loss

< 50k > 50k

< 50k > 50k

> 50k

<32 ≥32

otherWife,Husband

< 6.700 ≥ 6.700

< 2.017 ≥ 2.017

<1.704 ≥1.704

<4.700 ≥4.700

Capital loss Marital status

Capital gain

Capital loss Capital loss

Capital loss > 50k

< 50k > 50k < 50k

< 50k > 50k < 50kMarital status

> 50k < 50k

<8.300 ≥8.300

<2.388 ≥2.388

<1.982 ≥1.982 <2.562 ≥2.562

<5.100 ≥5.100

<5.100 ≥5.100MarriedCivSpouse,

MarriedAFspouseMarriedSpouseAbsent, Divorced,

Separated, Widowed, NeverMarried

Married Civ Spouse,Married AF spouse

other

FIGURE 15 | The obtained solution: decision tree induced within the described evolutionary process for the Adult dataset.

It is interesting to see this result compared withother well-known classification algorithms. The UCIrepository reports the classification results (classifi-cation accuracy) for 16 different algorithms usingthe original train/test split,11 the same as we did inour experiment. The best classification accuracy wasobtained with two adapted Naıve-Bayes approaches(FSS NB achieved 85.95% and NBTree achieved85.90%), followed by our solution at rank three with85.63%. All of the other algorithms scored worse,the lowest accuracy of 78.58% was achieved withk-Nearest Neighbor algorithm.

From the existing, classical DT induction algo-rithms, the best accuracy of 85.54% was achievedwith C4.5 (a benchmark algorithm for classical, sta-tistical induction of DTs), which is slightly worse thanour solution.

For the sake of further comparison, we in-duced a DT with J48 algorithm (implementation ofC4.5 algorithm in Weka, which is widely used). Itachieved 85.84% accuracy on test set and was com-posed of 564 decision nodes. We can see that ourevolutionary constructed DT is incomparably simpler(representing only 3% of the size of classically in-duced DT!) with practically the same accuracy (only0.21% difference). As the goal of inducing DTs isto find a highly reliable (high accuracy) and sim-ple (small size) tree, we may say that the evolution-ary process resulted in a much better solution onthis occasion; and it turns out to be so on manyoccasions.

CONCLUSIONS

DTs are widely used because of their intuitive rep-resentation, which enables easy understanding andinterpretability of the final solution.4 Their accu-racy is also at least comparable to other clas-sification models.13,14 Classical approach to DTinduction is the application of the ‘divide and con-quer’ principle. It can, however, lead to local op-tima and is unable to provide diverse solutions (ifrequired). The induction of DTs using EAs is a viablealternative.

Artificial evolution is a random process. Thephase of evolutionary construction is about creatingboth the structure and contents of the DT. To con-stantly evolve good (although diverse) DTs, guidancein the form of a fitness function must be provided.Although construction of a fitness function is lackinga universal solution, several typical existing fitnesstemplates suffice for most problems.

Evolutionary process of DT construction isquite demanding in terms of setting up the initial en-vironment (representation, control parameters); how-ever, it is able to search past local optima, can opti-mize solutions according to multiobjective criteria,and is able to provide different solutions. The casestudy has shown that EA construction was able tofind a relatively small, easily understandable tree witha very competitive accuracy.

To produce DTs, beside directly evolving DTsas we have demonstrated, evolution can also be used

80 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .

WIREs Data Mining and Knowledge Discovery Evolutionary design of decision trees

on a higher level to compose a (nonevolutionary) DTinducing algorithm. This is viable as traditional algo-rithms typically have a top-down hierarchy of designcomponents that can be changed and tweaked—a taskideal for evolutionary approach.58,59

To summarize, EAs is a recognized method forDT construction. They are able to find solutions thatcan be subjected to multiobjective criteria. This al-lows experimentation with the construction processto produce solutions with the desired properties.

REFERENCES

1. Barros RC, Basgalupp MP, de Carvalho ACPLF, Fre-itas AA. A survey of evolutionary algorithms fordecision-tree induction. IEEE Trans Syst Man Cyber-net 2012, 42(3):291–312.

2. De Mantaras RL. A distance-based attribute selectionmeasure for decision tree induction. Mach Learn 1991,6(1):81–92.

3. Quinlan JR. C4.5: Programs for Machine Learning.San Francisco, CA: Morgan Kaufmann; 1993.

4. Breiman L, Friedman JH, Olsen RA, Stone CJ.Classification and Regression Trees. Belmont, CA:Wadsworth; 1984.

5. Quinlan JR. Induction of decision trees. Mach Learn1986, 1:81–106.

6. Freitas AA. Data Mining and Knowledge Discoverywith Evolutionary Algorithms. Secaucus, NJ: Springer-Verlag; 2002.

7. Rokach L, Maimon OZ. Data mining with decisiontrees: theory and applications. Mach Percept Artif In-tell 2008, 69:1–244.

8. Podgorelec V, Zorman M. Decision Trees. Encyclo-pedia of Complexity and Systems Science. New York,NY: Springer; 2009, 2:1826–1845.

9. Loh WY. Classification and regression trees. WIREsData Min Knowledge Discov 2011, 1:14–23.

10. Kokol P, Pohorec S, Stiglic G, Podgorelec V. Evolution-ary design of decision trees for medical application.WIREs Data Min Knowledge Discov 2012, 2:237–254.

11. Frank A, Asuncion A. UCI Machine Learning Reposi-tory. Irvine, CA: University of California, School of In-formation and Computer Science; 2010. Available at:http://archive.ics.uci.edu/ml. (Accessed May 18, 2012).

12. Gehrke J. Decision Tress, The Handbook of Data Min-ing. Mahwah, NJ: Lawrence Erlbaum Associates, Pub-lishers; 2003.

13. Lim TS, Loh WY, Shih YS. A comparison of predictionaccuracy, complexity, and training time of thirty-threeold and new classification algorithms. Mach Learn2000, 48:203–228.

14. Hand D. Construction and Assessment of Classifica-tion Rules. Chichester, United Kingdom: John Wiley& Sons; 1997.

15. Goebel M, Gruenwald L. A survey of data mining soft-ware tools. SIGKDD Explorations 1999, 1(1):20–33.

16. Hunt EB, Marin J, Stone PT. Experiments in Induc-tion. New York, NY: Academic Press; 1966, 1:45–69.

17. Rokach L, Maimon O. Top-down induction of deci-sion trees classifiers—a survey. IEEE Trans Syst ManCybernet 2005, 35(4):476–487.

18. Brodley CE, Uttgof PE. Multivariate decision trees.Mach Learn 1995, 19(1):45–77.

19. Utgoff PE. Incremental induction of decision trees.Mach Learn 1989, 4(2):161–186.

20. Quinlan JR. Learning with continuous classes. Pro-ceedings of the 5th Australian Joint Conference on AI;1987, 221–234.

21. Holland JH. Adaptation in Natural and Artificial Sys-tems. Ann Arbor, MI: The University of MichiganPress; 1975.

22. Koza JR. Genetic Programming: On the Programmingof Computers by Means of Natural Selection. Cam-bridge, MA: MIT Press; 1992.

23. Fogel LJ, Owens AJ, Walsh MJ. Artificial Intelligencethrough Simulated Evolution. New York, NY: JohnWiley& Sons; 1966.

24. Schwefel HP. Evolution and Optimum Seeking. NewYork, NY: John Wiley & Sons, 1995.

25. Rothlauf F. Representations for Genetic and Evolu-tionary Algorithms. Heidelberg, Germany/New York,NY: Springer; 2006.

26. Goldberg DE. Genetic Algorithms in Search, Optimiza-tion and Machine Learning. Reading, MA: Addison-Wesley; 1989.

27. Kretowski M, Grzes M. Global learning of decisiontrees by an evolutionary algorithm. Inf Process SecuritySyst 2005, 3:401–410.

28. Barros RC, Basgalupp MP, Ruiz DD, de Carvalho AC-PLF, Freitas AA. Evolutionary model tree induction.Proceedings of the 2010 ACM Symposium on AppliedComputing; 2010, 1131–1137.

29. Papagelis A, Kalles D. Breeding decision trees usingevolutionary techniques. Proceedings of the EighteenInternational Conference on Machine Learning; 2001,393–400.

Volume 3, March /Apr i l 2013 81c© 2012 John Wi ley & Sons , Inc .

Overview wires.wiley.com/widm

30. Cha SH, Tappert C. A genetic algorithm for construct-ing compact binary decision trees. J Pattern RecognitRes 2009, 1:1–13.

31. Dumitrescu D, Andras J. Generalized decision treesbuilt with evolutionary techniques. Stud Inform Con-trol 2005, 14(1):15–22.

32. Palmer CC. An Approach to a Problem in NetworkDesign Using Genetic Algorithms. PhD thesis. Troy,NY: Polytechnic University; 1994.

33. Raidl GR, Drexel C. A predecessor coding in an EA forthe capacitated minimum spanning tree problem. LateBreaking Papers at the 2000 Genetic and EvolutionaryComputation Conference; Las Vegas, NV: 2000, 309–316.

34. Tang KS, Man KF, Ko KT. Wireless LAN design usinghierarchical genetic algorithm. Proceedings of the Sev-enth International Conference on Genetic Algorithms.East Lansing, MI: Morgan Kaufmann; 1997, 629–635.

35. Sinclair MC. Minimum cost topology optimisation ofthe COST 239 European optical network. In: Pear-son DW, Steele NC, Albrecht RF, eds. Proceedings ofthe 1995 International Conference on Artificial NeuralNets and Genetic Algorithms. Springer-Verlag, NewYork, NY; 1995, 26–29.

36. Kretowski M, Grzes M. Mixed decision trees: an evo-lutionary approach. Proceedings of the InternationalConference on Data Warehousing and Knowledge Dis-covery. Springer-Verlag, Heidelberg, Germany; 2006,260–269.

37. Zhao H, A multi-objective genetic programming ap-proach to developing Pareto optimal decision trees.Decis Support Syst 2007, 43(3):809–826.

38. Llora X, Garrell JM. Evolution of decision trees. Pro-ceedings of 4th Catalan Conference on Artificial Intel-ligence; 2001, 115–122.

39. Zorman M, Podgorelec V, Kokol P, Peterson MGE,Sprogar M, Ojstersek M. Finding the right decisiontree’s induction strategy for a hard real world problem.Int J Med Inform 2001, 63(1–2):109–121.

40. Papagelis A, Kalles D. Breeding decision trees usingevolutionary techniques. Proceedings of 18th Interna-tional Conference on Machine Learning, ICML-2001.Morgan Kaufmann, Burlington, MA; 2001, 393–400.

41. Van Veldhuizen DA, Lamont GB. Multiobjective evo-lutionary algorithms. Analyzing the state-of-the-art.Evol Comput 2000, 8(2):125–147.

42. Barros RC, Ruiz DD, Basgalupp MP. Evolutionarymodel trees for handling continuous classes in machinelearning. Inf Sci 2011, 171(5):954–971.

43. Coello CA, Lamont GL, van Veldhuizen DA. Evolu-tionary algorithms for solving multi-objective prob-lems. Genetic and Evolutionary Computation. 2ndEdn. New York, NY: Springer; 2007.

44. Freitas AA. A review of evolutionary algorithms fordata mining. Soft Computing for Knowledge Discoveryand Data Mining. New York, NY: Springer; 2008, 79–111.

45. Basgalupp MP, de Carvalho ACPLF, Barros RC, RuizDD, Freitas AA. Lexicographic multi-objective evolu-tionary induction of decision trees. Int J Bio-InspiredComput 2009, 1(1–2):105–117.

46. Freitas AA. A critical review of multi-objective op-timization in data mining: a position paper. ACMSIGKDD Explorations Newslett 2004, 6(2):77–86.

47. Lomax S, Vadera S. A survey of cost-sensitive deci-sion tree induction algorithms. Soft Comput Knowl-edge Discov Data Min 2008, 79–111.

48. Fu Z, Golden BL, Lele S, Raghavan S, Wasil E. Diver-sification for better classification trees. Comput Oper-ations Res 2006, 33(11):3185–3202.

49. Eggermont J, Kok JN. Detecting and pruning intronsfor faster decision tree evolution. Parallel Probl SolvingNat 2004, 1:1071–1080.

50. Sprogar M, Kokol P, Babic SH, Podgorelec V, Zor-man M. Vector decision trees. Intell Data Anal 2000,4:305–321.

51. Podgorelec V, Kokol P. Evolutionary decision forests—decision making with multiple evolutionary con-structed decision trees. Problems in Applied Mathemat-ics and Computational Intelligence. Piraeus, Greece:WSES Press; 2001, 1:97–103.

52. Nikolaev N, Slavov V. Inductive genetic programmingwith decision trees. Vol. 1224, In: van Someren M,Widmer G, eds. Machine Learning: ECML-97 (LectureNotes in Computer Science Series). Berlin/Heidelberg,Germany: Springer; 1997, 183–190.

53. Pyle D. Data Preparation for Data Mining. San Fran-cisco, CA: Morgan Kaufmann Publishers, Inc.; 1999.

54. Zhang M, Wong P. Genetic programming formedical classification: a program simplification ap-proach. Genet Program Evolv Mach 2008, 9(3):229–255.

55. Mitchell M. An Introduction to Genetic Algorithms.Cambridge, MA: MIT Press; 1999.

56. Podgorelec V, Kokol P. Evolutionary induced decisiontrees for dangerous software modules prediction. InfProcess Lett 2002, 82(1):31–38.

57. Grefenstette JJ. Optimization of control parameters forgenetic algorithms. IEEE Trans Syst Man Cybernet1986, 16(1):122–128.

58. Barros RC, Basgalupp MP, Freitas AA. Towards theautomatic design of decision tree induction algo-rithms. Proceedings of the 13th Annual ConferenceCompanion on Genetic and Evolutionary Computa-tion GECCO’11. ACM, New York, NY; 2011, 567–574.

59. Barros RC, Basgalupp MP, de Carvalho ACPLF, Fre-itas AA. A hyper-heuristic evolutionary algorithmfor automatically designing decision-tree algorithms.Proceedings of the 14th Annual Conference Com-panion on Genetic and Evolutionary ComputationGECCO’12. ACM, New York, NY; 2012, 1237–1244.

82 Volume 3, March /Apr i l 2013c© 2012 John Wi ley & Sons , Inc .