optimizing accuracy and size of decision trees - department of

ERK'2007, Portorož, B:81-84 81

Optimizing Accuracy and Size of Decision Trees

Tea Tusar

Department of Intelligent Systems

Jozef Stefan Institute

Jamova 39, SI-1000 Ljubljana, Slovenia

[email protected]

Abstract

This paper presents the problem of finding parame-ter settings of algorithms for building decision treesthat yield optimal trees—accurate and small. Theproblem is tackled using DEMO algorithm, an evo-lutionary algorithm for multiobjective optimizationthat uses differential evolution to explore the deci-sion space. The results of the experiments on sixdatasets show that DEMO is capable of efficientlysolving this problem, offering the users a wide choiceof near-optimal decision trees with different accura-cies and sizes in a reasonable time.

1 Introduction

Domain experts often use machine learning algo-rithms for finding theories that would explain theirdata. However, they are usually unfamiliar with thesealgorithms and do not know how to set their param-eters in order to produce the desired results. More-over, they rarely know beforehand exactly what kindof theory they are looking for. Our approach canhelp them by exploring the parameter space of thelearning algorithms while searching for theories withhighest prediction accuracy and lowest complexity.The resulting set of the best found theories gives theusers the possibility of comparing the theories amongthemselves and provides an additional insight into thedata. All this helps the users to choose the theorythat best suits their needs.

In the following, we limit our discussion of ma-chine learning theories to decision trees for classifica-tion, where the best trees (or theories) are regardedas those that are accurate and small. In Section 2 wedefine the mentioned optimization problem, reviewthe related work and show how the DEMO algorithmcan be employed to solve it1. Section 3 presents theexperiments and their results. The paper concludeswith a summary and discussion of future work possi-bilities in Section 4.

1The source of the DEMO algorithm an be downloadedfrom http://dis.ijs.si/tea/research.htm.

2 Accuracy and size as two conflicting

objectives

2.1 Optimization problem

The tackled optimization problem consists of find-ing the parameter settings of machine learning algo-rithms in order to construct accurate and small treesfor a given domain. Accuracy is estimated using 10-fold cross validation, while the size of the tree is equalto the number of all nodes in the tree. Accuracy mustbe maximized and size minimized. We deal with thisproblem for the special case of decision trees inducedby the C4.5 algorithm [7], or more precisely, its Javaimplementation in the Weka environment [12] calledJ48.

When building J48 trees, several parameters needto be set (see Table 1). The large amount of possibleparameter settings calls for a heuristic method forsolving this problem.

2.2 Related work

Kohavi and John [3] searched for parameter settingsof C4.5 decision trees that would result in optimalperformance on a particular dataset. They consid-ered four parameters: M , C and S with the samemeaning as described in Table 1, and G—a binaryparameter that determined if the splitting criterionwould be information gain or gain ratio. The opti-mization objective was ‘optimal performance’ of thetree, i.e., the accuracy measured using 10-fold crossvalidation. The problem was tackled as a discrete op-timization problem and the best-first search was cho-sen to explore the parameter space. Experiments on33 datasets showed that best-first search found bet-ter parameter settings than the default on nine do-mains, while on one dataset, default parameter valuesyielded better trees than the heuristic search.

Similar experiments were performed by Mlade-nic [4], who searched for the optimal setting of the m-value in m-estimate postpruning of decision trees [2].The sole optimization objective was accuracy esti-mated with cross validation. She explored the deci-

82

Possible DefaultName values value Description

M – number of instances 1, 2, . . . 2 Minimum number of instances in leaves (higher values result in smaller trees).U – unpruned trees yes/no no Use unpruned tree (the default value ‘no’ means that the tree is pruned).C – confidence factor [10−7, 0.5] 0.25 Confidence factor used in postpruning (smaller values incur more pruning).S – subtree raising yes/no yes Whether to consider the subtree raising operation in postpruning.B – use binary splits yes/no no Whether to use binary splits on nominal attributes when building the tree.

Table 1: Parameters for building J48 trees.

sion space using different deterministic and stochasticalgorithms. The tests on eight datasets showed thatthe problem is rather simple and that all tested algo-rithms achieved comparable results.

Both mentioned approaches optimized only theaccuracy of the decision trees. To our best knowl-edge, no work has been done on searching for param-eter settings of decision tree building algorithms thatwould consider accuracy and size of the trees as twooptimization objectives.

Bohanec and Bratko [1] searched for good trade-offs between accuracy and size of decision trees ina different way. They presented the OPT algorithmwhich explores the space of all trees that can be de-rived from a complete ID3 tree [6] by pruning, usingdynamic programming. The result is an optimal se-quence of pruned trees, decreasing in size, such thateach tree has the highest accuracy among all possi-ble pruned trees of the same size. While OPT worksperfectly for its purpose, it has two serious drawbacksif it was to be applied to serve our needs. The firstdifficulty is its time complexity, which is quadraticwith respect to the number of leaves of the originalID3 tree. The second disadvantage is that the accu-racy of the trees is measured on the dataset that wasused for constructing these trees by simply countingthe additional classification errors made with prun-ing. If a separate test set would be used for estimat-ing the accuracy, the time needed for building suchtrees would increase considerably.

2.3 Optimization with DEMO

DEMO [8, 9] is an evolutionary algorithm for multi-objective optimization that uses differential evolution(DE) instead of genetic algorithms (GAs) for explor-ing the decision space and outperforms state-of-the-art GA-based algorithms on several multiobjectivebenchmark problems [10].

Since we want to help the users of machine learn-ing algorithms to find good trees without having tosearch for the right parameter settings manually, wemust be careful not to demand from them to set theparameters of DEMO instead. This is why we chose asingle parameter setting of DEMO for all the experi-ments performed on this problem. This setting was inno way fitted to the domains used and should there-

fore be appropriate for any classification domain.We used the Lamarckian repair procedure to round

the values of the variable M to the first nearest in-teger and the Baldwinian repair procedure for thebinary variable C. The other parameters of DEMOwere set as follows:

– population size = 20,

– number of generations = 25,

– DE selection scheme = DE/rand/1/bin,

– scaling factor F = 0.6,

– probability in binomial crossover CR = 0.6,

– environmental selection procedure = strengthPareto approach (as in SPEA2).

3 Experiments and results

In the experiments, we compare trees found byDEMO to trees found by random search of the de-cision space and the default tree—the tree that isconstructed using the default parameters of J48. Be-cause datasets can be very large and consequentlythe time to build a single J48 tree very long, we limitthe number of generated trees by DEMO and randomsearch to 500 each. While this was often not enoughfor DEMO to converge, we had to persist in the lownumber of evaluations to provide the final solutionsin a reasonable time. Out of all trees, only the non-dominated ones are presented in the results. BecauseDEMO and random search are stochastic algorithms,they were run 10 times on each dataset.

All algorithms were run on six datasets. The first(EDM) refers to process parameter selection in elec-trical discharge machining [11], while the other five(dermatology, nursery, splice, vehicle and vowel) wereobtained from the UCI repository [5].

Figure 1 shows the default tree and the trees foundin the best run of DEMO and random search (accord-ing to the IH indicator). The comparison between thetrees found by the two heuristic methods and the de-fault tree shows that the default tree is often largerand less accurate than the other trees. On the der-matology, nursery, splice and vehicle datasets DEMOalways finds trees that weakly dominate the defaulttree, while this happens in at least half of the runs on

83

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40

EDM

defaultrandom search

DEMO 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45

dermatology


DEMO 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600

nursery


DEMO

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250

splice


DEMO 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180 200

vehicle


DEMO 0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250 300 350

vowel


DEMO

Figure 1: Objective values of trees found by the algorithms on the six chosen datasets. The size of trees isplaced on the abscissa, while classification accuracy is represented on the ordinate.

the EDM and vowel datasets. In some of the runs,even trees found by random search dominate the de-fault tree. This proves that it is indeed important forthe users of the J48 algorithm to try some other pa-rameter setting beside the default one when buildinga decision tree model. Additional significance testshave shown that while DEMO outperforms randomsearch with regard to some performance indicator onthe five UCI datasets (see [9] for more details onthese experiments), there is no significant differencebetween the algorithms on the EDM dataset.

In a single run of DEMO and random search, 500parameter settings were inspected by each algorithm.Since all experiments were repeated 10 times, the al-gorithms jointly built 10000 decision trees for eachdataset. This gives us the possibility to look at thedecision space of the optimization problem and seeif there exist specific parameter settings that inducegood decision trees. To this end we gathered all 10000decision trees for each dataset and denoted which ofthem are dominated and which are not. Since the de-cision space is five-dimensional, it cannot be simplypresented here. Therefore, we only show its projec-tion on the two-dimensional space, defined by the pa-rameters M and C, where the parameter U is set tono (only trees subject to postpruning are considered).

Inspecting the plots of dominated and nondomi-nated trees in Figure 2 we can easily see that theparameter M has a big influence on the size and ac-curacy of the trees, while the effect of the parameterC seems to be much smaller. This causes the vertical‘stripes’ of nondominated trees. When M is large, the

trees are subject to heavy prepruning, which leaveslittle or no room for postpruning. This is why the‘stripes’ are so well defined for large values of M .When M is smaller, the quality of trees depends alsoon the parameter C. Note that the quality of trees isalways influenced also by the other three parameters(U , S and B), whose values are not shown in theseplots.

The plots indicate that nondominated trees canbe found at almost any point in the decision spaceand that their location depends very much on thedataset. Consequently, no general rule for predict-ing the optimal parameter values can be found. Thisgives additional evidence that searching for param-eters of decision tree building algorithms has to beperformed for every dataset separately.

4 Conclusion

The proposed real-world optimization problemconsists of finding the parameter settings of a ma-chine learning algorithm that result in accurate andsmall decision trees. The results of the performedexperiments on six real domains showed that DEMOis capable of efficiently solving this problem, offeringthe users a wide choice of near-optimal decision treeswith different accuracies and sizes to choose from.Such an approach for searching accurate and simpletheories could be extended to other machine learningalgorithms.

Another direction for future work is to use DEMO

84

0

0.1

0.2

0.3

0.4

0.5

0 50 100 150 200

EDM

dominated nondominated

0

0.1

0.2

0.3

0.4

0.5

0 30 60 90 120 150 180

dermatology


0

0.1

0.2

0.3

0.4

0.5

0 1000 2000 3000 4000 5000 6000

nursery


0

0.1

0.2

0.3

0.4

0.5

0 300 600 900 1200 1500

splice


0

0.1

0.2

0.3

0.4

0.5

0 100 200 300 400

vehicle


0

0.1

0.2

0.3

0.4

0.5

0 150 300 450

vowel


Figure 2: Dominated (grey) and nondominated (black) trees found by DEMO and random search for variousvalues of parameters M (on the abscissa) and C (on the ordinate), while U is set to no.

for optimizing more than two objectives for the con-sidered problem. In the case of decision trees, forexample, the users might want to optimize also someother objective beside accuracy and size, such as the‘degree of interestingness’ of the induced models, es-timated in terms of presence or absence of some at-tributes in the models.

References

[1] M. Bohanec and I. Bratko. Trading accuracy forsimplicity in decision trees. Machine Learning,15(3):223–250, 1994.

[2] B. Cestnik and I. Bratko. On estimating proba-bilities in tree pruning. In Proceedings of the Eu-ropean working session on learning on MachineLearning (EWSL ’91), pages 138–150, 1991.

[3] R. Kohavi and G. H. John. Automatic parame-ter selection by minimizing estimated error. InProceedings of the Twelfth International Confer-ence on Machine Learning (ICML 1995), pages304–312, 1995.

[4] D. Mladenic. Domain-tailored machine learning.Master’s thesis, University of Ljubljana, Facultyof Electrical Engineering and Computer Science,1995.

[5] D. J. Newman, S. Hettich, C. L. Blake, andC. J. Merz. UCI repository of machine learn-ing databases, 1998.

[6] J. R. Quinlan. Induction of decision trees. Ma-chine Learning, 1(1):81–106, 1986.

[7] J. R. Quinlan. C4.5: Programs for MachineLearning. Morgan Kaufmann, 1993.

[8] T. Robic and B. Filipic. DEMO: Differentialevolution for multiobjective optimization. InProceedings of the Third International Confer-ence on Evolutionary Multi-Criterion Optimiza-tion (EMO 2005), pages 520–533, 2005.

[9] T. Tusar. Design of an algorithm for multiob-jective optimization with differential evolution.Master’s thesis, University of Ljubljana, Facultyof Electrical Engineering and Computer Science,2007.

[10] T. Tusar and B. Filipic. Differential evolutionversus genetic algorithms in multiobjective op-timization. In Proceedings of the Fourth In-ternational Conference on Evolutionary Multi-Criterion Optimization (EMO 2007), pages 257–271, 2007.

[11] J. Valentincic and M. Junkar. Detection of theeroding surface in the EDM process based onthe current signal in the gap. The InternationalJournal of Advanced Manufacturing Technology,28(3-4):294–301, 2006.

[12] I. H. Witten and E. Frank. Data Mining: Prac-tical Machine Learning Tools and Techniques.Morgan Kaufmann, 2005.

optimizing accuracy and size of decision trees - department of

Documents