auto-weka - circle - university of british columbia

75
Auto-WEKA: Combined Selection and Hyperparameter Optimization of Supervised Machine Learning Algorithms by Chris Thornton B.Sc, University of Calgary, 2011 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the faculty of graduate and postdoctoral studies (Computer Science) The University Of British Columbia (Vancouver) March 2014 c Chris Thornton, 2014

Upload: others

Post on 25-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Auto-WEKA: Combined Selection and HyperparameterOptimization of Supervised Machine Learning

Algorithms

by

Chris Thornton

B.Sc, University of Calgary, 2011

a thesis submitted in partial fulfillmentof the requirements for the degree of

Master of Science

in

the faculty of graduate and postdoctoral studies(Computer Science)

The University Of British Columbia(Vancouver)

March 2014

c© Chris Thornton, 2014

Abstract

Many different machine learning algorithms exist; taking into account each algorithm’sset of hyperparameters, there is a staggeringly large number of possible choices. Thisproject considers the problem of simultaneously selecting a learning algorithm andsetting its hyperparameters. Previous works attack these issues separately, but thisproblem can be addressed by a fully automated approach, in particular by leveragingrecent innovations in Bayesian optimization. The WEKA software package providesan implementation for a number of feature selection and supervised machine learningalgorithms, which we use inside our automated tool, Auto-WEKA. Specifically, weexamined the 3 search and 8 evaluator methods for feature selection, as well asall of the classification and regression methods, spanning 2 ensemble methods, 10meta-methods, 27 base algorithms, and their associated hyperparameters. On 34popular datasets from the UCI repository, the Delve repository, the KDD Cup 09,variants of the MNIST dataset and CIFAR-10, our method produces classificationand regression performance often much better than obtained using state-of-the-artalgorithm selection and hyperparameter optimization methods from the literature.Using this integrated approach, users can more effectively identify not only the bestmachine learning algorithm, but also the corresponding hyperparameter settings andfeature selection methods appropriate for that algorithm, and hence achieve improvedperformance for their specific classification or regression task.

ii

Preface

This thesis is an expanded version of work that has been published as C. Thornton,F. Hutter, H. H. Hoos, and K. Leyton-Brown; Auto-WEKA: combined selection andhyperparameter optimization of classification algorithms; in Proceedings of the 19thACM SIGKDD international conference on knowledge discovery and data mining,pages 847-855; ACM, 2013. I was involved in the conceptual design of Auto-WEKA,and was responsible for the development of Auto-WEKA’s code. I performed all theexperiments and analysis of results. In the remainder of this thesis, I adopt the firstperson plural in recognition of my collaborators.

iii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Supervised machine learning problems . . . . . . . . . . . . . . . . . . 21.2 Learning algorithm selection . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Previous approaches to learning algorithm selection . . . . . . 31.3 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Previous approaches to solving hyperparameter optimization . 5

2 CASH and algorithms for solving it . . . . . . . . . . . . . . . . . . . 82.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Model-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Sequential model-based algorithm configuration (SMAC) . . . 122.2.2 Tree-structured Parzen estimator (TPE) . . . . . . . . . . . . . 132.2.3 Iterated F-Race (I/F-Race) . . . . . . . . . . . . . . . . . . . . 15

3 Auto-WEKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iv

4 Evaluating Auto-WEKA . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 The importance of solving CASH effectively . . . . . . . . . . . 274.2.2 Results for training performance . . . . . . . . . . . . . . . . . 294.2.3 Results for test performance . . . . . . . . . . . . . . . . . . . . 314.2.4 Selected methods . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Regression results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.1 Results for training performance . . . . . . . . . . . . . . . . . 384.3.2 Results for test performance . . . . . . . . . . . . . . . . . . . . 404.3.3 Selected methods . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Other modifications of SMAC-based Auto-WEKA . . . . . . . . . . . 454.4.1 Immediate evaluation of all folds . . . . . . . . . . . . . . . . . 454.4.2 Multi-level cross-validation . . . . . . . . . . . . . . . . . . . . 474.4.3 Repeated random subsampling validation (RRSV) . . . . . . . 494.4.4 Longer runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A Method Comparison Results . . . . . . . . . . . . . . . . . . . . . . . 62

v

List of Tables

Table 3.1 Learning algorithms in Auto-WEKA. ∗ indicates meta-methods,which in addition to their own parameters take one base algorithmand its parameters. + indicates ensemble methods that take as inputup to 5 base algorithms and their parameters. We report the numberof categorical and numeric hyperparameters for each method. . . . 18

Table 3.2 Feature Search/Evaluator methods in Auto-WEKA. ∗ indicatessearch methods requiring one feature evaluator that is used to deter-mine the importance of a feature. . . . . . . . . . . . . . . . . . . . 19

Table 4.1 Classification datasets used. Num Categorical and Num Numericrefer to the number of categorical and numeric attributes of elementsin the dataset, respectively. . . . . . . . . . . . . . . . . . . . . . . . 28

Table 4.2 Oracle performance of Ex-Def and grid search. . . . . . . . . . . . . 29Table 4.3 Training performance on classification datasets (Error %). Bold

entries denote performance statistically insignificant from the best,according to a Welch’s t test with p = 0.01. . . . . . . . . . . . . . . 30

Table 4.4 Test performance on classification datasets (Error %). Bold entriesdenote performance statistically insignificant from the best, accordingto a Welch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . 32

Table 4.5 Correlation between the withheld 30% validation data and the train-ing data performance. Gap indicates the difference between the meantraining performance and mean test performance from Tables 4.3and 4.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Table 4.6 Regression datasets used. Num Categorical and Num Numeric referto the number of categorical and numeric attributes of elements inthe dataset, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 38

vi

Table 4.7 Training performance on regression datasets (RMSE). Bold entriesdenote performance statistically insignificant from the best, accordingto a Welch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . 39

Table 4.8 Test performance on regression datasets (RMSE). Bold entries denoteperformance statistically insignificant from the best, according to aWelch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . . . . 40

Table 4.9 Correlation between the withheld 30% validation data and the train-ing data performance. Gap indicates the difference between the meantraining performance and mean test performance from Tables 4.7and 4.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Table 4.10 Comparisons of mean performance obtained between the SMACand SMAC-10-Batch variants on classification datasets. Bold entriesdenote performance statistically insignificant from the best, accordingto a Welch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . 46

Table 4.11 Comparisons of mean performance obtained between the SMAC andSMAC-10-Batch variants on regression datasets. Bold entries denoteperformance statistically insignificant from the best, according to aWelch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . . . . 46

Table 4.12 Comparisons of mean performance obtained between the SMACand SMAC-Multi-Level variants on classification datasets. Boldentries denote performance statistically insignificant from the best,according to a Welch’s t test with p = 0.01. . . . . . . . . . . . . . . 48

Table 4.13 Comparisons of mean performance obtained between the SMACand SMAC-Multi-Level variants on regression datasets. Bold entriesdenote performance statistically insignificant from the best, accordingto a Welch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . 48

Table 4.14 Comparisons of mean performance obtained between the SMAC andSMAC-RRSV variants on classification datasets. Bold entries denoteperformance statistically insignificant from the best, according to aWelch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . . . . 50

Table 4.15 Comparisons of mean performance obtained between the SMAC andSMAC-RRSV variants on regression datasets. Bold entries denoteperformance statistically insignificant from the best, according to aWelch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . . . . 50

vii

Table 4.16 Comparisons of mean performance obtained between the SMAC andSMAC-Long variants on classification datasets. Bold entries denoteperformance statistically insignificant from the best, according to aWelch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . . . . 53

Table 4.17 Comparisons of mean performance obtained between the SMAC andSMAC-Long variants on regression datasets. Bold entries denoteperformance statistically insignificant from the best, according to aWelch’s t test with p = 0.01. . . . . . . . . . . . . . . . . . . . . . . 53

Table A.1 Number of statistical significant wins on training performance ofeach method compared against another on classification datasets. . 63

Table A.2 Number of statistical significant wins on test performance of eachmethod compared against another on classification datasets. . . . . 63

Table A.3 Number of statistical significant wins on training performance ofeach method compared against another on regression datasets. . . . 64

Table A.4 Number of statistical significant wins on test performance of eachmethod compared against another on regression datasets. . . . . . . 64

viii

List of Figures

Figure 3.1 Auto-WEKA’s top-level parameters. Top: is base controls Auto-WEKA’s choice of either using a base algorithm or the using eithera meta or ensemble learner. The triangular items represent aparameter that selects one of the 27 base algorithms and associatedhyperparameters. Bottom: feat sel controls Auto-WEKA’s choiceof feature selection methods. . . . . . . . . . . . . . . . . . . . . . 20

Figure 3.2 Auto-WEKA’s wizard interface. . . . . . . . . . . . . . . . . . . . 22Figure 3.3 Auto-WEKA’s experiment builder workflow. . . . . . . . . . . . . 23Figure 3.4 Auto-WEKA’s interface for examining the best learning algorithm

and hyperparameters after an experiment has been run. . . . . . 24

Figure 4.1 Distribution of chosen classifiers aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants across all the small and largedatasets, ranked on their frequency of being selected. Meta-methodsare marked by a ∗ suffix, ensemble methods by a + suffix. . . . . . 36

Figure 4.2 Heat map of chosen classifiers aggregated across SMAC, I/F-Raceand TPE Auto-WEKA variants for each dataset. A darker colourindicates the method was selected more often. Meta-methods aremarked by a ∗ suffix, ensemble methods by a + suffix. Datasets aresorted by size, classifiers are ordered by methodology. . . . . . . . 36

Figure 4.3 Left: distribution of chosen base classifiers for the two most fre-quently selected meta-methods: AdaBoostM1 and MultiClass clas-sifier. Right: distribution of chosen feature search and evaluatormethods. Both plots are aggregated across all Auto-WEKA variants;None indicates that no feature selection was performed. . . . . . . 37

ix

Figure 4.4 Heat map of chosen classifiers in all chosen meta-methods aggre-gated across SMAC, I/F-Race, and TPE Auto-WEKA variants foreach dataset. A darker colour indicates the method was selectedmore often. Datasets are sorted by size, classifiers are ordered bymethodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 4.5 Distribution of chosen regression algorithms aggregated acrossSMAC, I/F-Race and TPE Auto-WEKA variants across all smalland large datasets, ranked on their frequency of being selected.Meta-methods are marked by a ∗ suffix, ensemble methods by a +

suffix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 4.6 Heat map of chosen regression algorithms aggregated across SMAC,

I/F-Race and TPE Auto-WEKA variants for each dataset. A darkercolour indicates that the method was selected more often. Meta-methods are marked by a ∗ suffix, ensemble methods by a + suffix.Datasets are sorted by size, regression algorithms are ordered bymethodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Figure 4.7 Left: distribution of chosen base regression algorithms for the twomost frequently selected meta-methods: additive regression andbagging. Right: distribution of chosen feature search and evaluatormethods. Both plots are aggregated across all Auto-WEKA variants;None indicates that no feature selection was performed. . . . . . . 44

Figure 4.8 Heat map of chosen regression algorithms in all chose meta-methodsaggregated across SMAC, I/F-Race and TPE Auto-WEKA vari-ants for each dataset. A darker colour indicates that the methodwas selected more often. Datasets are sorted by size, regressionalgorithms are ordered by methodology. . . . . . . . . . . . . . . . 44

Figure 4.9 Graphical representation of the training data partitioning schemeused by SMAC-Multi-Level. . . . . . . . . . . . . . . . . . . . . . . 47

Figure 4.10 Trajectories of training and test performance over time for twosmall datasets. The vertical black line indicates the original 30hour time budget. Shaded areas show the 10-90% quantile from thebootstrapped samples. . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 4.11 Trajectories of training and test performance over time for twolarge datasets. The vertical black line indicates the original 30 hourtime budget. Shaded areas show the 10-90% quantile from thebootstrapped samples. . . . . . . . . . . . . . . . . . . . . . . . . . 55

x

Acknowledgements

There are many people who helped make this work happen. First, I would like to thankmy supervisors, Holger Hoos and Kevin Leyton-Brown, as well as close collaboratorFrank Hutter for their enduring guidance in working on this project.

The members of both the β-lab and GTDT reading group have been enormouslysupportive (both directly and indirectly): Alexandre Frechette, Baharak Rastegari,Chris Fawcett, David Thompson, James Wright, Sam Bayless, Steve Ramage, andZach Drudi.

Finally, thanks to my friends and family for their unceasing encouragement.

xi

Chapter 1

Introduction

An increasing variety of sophisticated feature selection and learning algorithms, com-plete with many hyperparameters, are currently available to the growing number ofmachine learning practitioners. These users require off-the-shelf solutions to theirdata analysis problems. The machine learning community has much aided such usersby making available open source packages, such as WEKA [Hall et al., 2009] andPyBrain [Schaul et al., 2010]. Such packages require a user to make two kinds ofchoices: first to select a learning algorithm and secondly to customize it by settinghyperparameters, which may also control feature selection. It can be daunting to makethe best choices when faced with numerous degrees of freedom. Often a user may lackin-depth understanding of the terminology and mechanics associated with each learningalgorithm and its hyperparameter settings. This leads many users to select algorithmsbased upon reputation or intuitive appeal, often leaving hyperparameters set to theirdefault values. Adopting such a selection approach can yield poor performance.

This suggests a natural challenge for machine learning; given a dataset, automati-cally and simultaneously choose a learning algorithm and set its hyperparameters tooptimize empirical performance. We dub this problem the combined algorithm selectionand hyperparameter optimization (CASH) problem. We provide a tool, Auto-WEKA,which requires minimal input from its user and provides a solution to CASH, searchingover the learning algorithms provided in the standard WEKA distribution.

The CASH problem consists of two main subproblems: algorithm selection andhyperparameter optimization. The remainder of this chapter defines these subproblems,and discusses previous work by the machine learning community to individually addressthem. In Chapter 2, we formally define the CASH problem, discussing the smallamount of attention that variants of the CASH problem have received in the literature,

1

as well as some possible methods to solve the CASH problem. Chapter 3 describesthe design and mechanics of Auto-WEKA, our solution to an instance of the CASHproblem. An in-depth empirical analysis of Auto-WEKA on 21 classification tasks and13 regression tasks, comparing Auto-WEKA against standard baselines, is provided inChapter 4. Future work is discussed in Chapter 5.

1.1 Supervised machine learning problemsOur work focuses on supervised machine learning problems: learning a functionf : X 7→ Y with X a set of features and Y either a finite set of different labels (forclassification), or a subset of R (for regression). A supervised learning algorithmA maps a set {d1, . . . , dn} of training data points di = (xi, yi) ∈ X × Y to such afunction. The family of functions that A can produce are often called models, whilethe output of A is often expressed via a vector of model parameters. The learnedfunction can then be used on new data points xj that were not contained inside thetraining set, predicting the corresponding value yj . Most learning algorithms A furtherexpose hyperparameters λ from a hyperparameter space Λ, which change the waythe learning algorithm Aλ learns the desired function. Hyperparameters are used toindicate quantities such as a description-length penalty, the kernel width of a supportvector machine, the number of neurons in a hidden layer in a neural network, andthe number of data points that a leaf in a decision tree must contain to be eligiblefor splitting. In order to obtain a function that produces accurate predictions, bothlearning algorithm selection and hyperparameter optimization need to be interwoven.

1.2 Learning algorithm selectionLearning algorithm selection, also called model selection, has been well studied by themachine learning community, a sample of which will be discussed in Section 1.2.1. Givena set of learning algorithms A and a set of training data D = {(x1, y1), . . . , (xn, yn)},the goal of model selection is to determine the algorithm A∗ ∈ A with the bestgeneralization performance. Generalization performance is estimated by splittingD into (possibly many) disjoint training and validation sets D(i)

train and D(i)valid for

i = 1, . . . , k and then learning functions fi by applying A∗ to D(i)train, evaluating the

predictive performance of these functions on D(i)valid. This allows for the learning

2

algorithm selection problem to be written as:

A∗ ∈ argminA∈A

1k·k∑i=1L(A,D(i)

train,D(i)valid), (1.1)

where L(A,D(i)train,D

(i)valid) is the loss achieved by A when trained on D(i)

train and evaluatedon D(i)

valid. For classification problems, the loss is typically defined as the rate at whichthe predictions have different labels than the validation data, whereas for regressionproblems the loss is often expressed as the root mean squared error (RMSE).

1.2.1 Previous approaches to learning algorithm selection

The simplest (and most general) approach that could be used to perform model selectionis to first fix a set A of many different learning algorithms, compute an estimate ofthe loss function using the set of partitioned training data, before finally selecting thealgorithm with the lowest loss. One of the most common techniques used for splittingthe training data into pairs of training and validation sets is k-fold cross-validation,which splits the training data into k equal-sized partitions D(1)

valid, . . . ,D(k)valid, and sets

D(i)train = D \ D(i)

valid for i = 1, . . . , k. This is not the only way to partition the trainingdata; Kohavi [1995] presents other techniques such as repeated random subsamplingvalidation. This exhaustive approach suffers from the high computational cost ofcomputing the estimated loss function of each algorithm, and the more philosophicalhurdle of deciding what algorithms should be included in the set A.

Hoeffding races [Maron and Moore, 1994] address the first of these issues: thecost of selecting amongst a number of different algorithms. The main idea in racingalgorithms for model selection is to determine which candidates (the models beingcompared) are highly probable to be inferior to the best candidates. Once inferiorcandidates have been identified, there is no need to expend further effort investigatingtheir performance. In a Hoeffding race, a schedule is chosen over the pairs of trainingand validation data sets uniformly at random, determining the order in which thepairs will be used for estimating the loss of each candidate. The race consists of manyrounds, where at each round, the next pair of training and validation data are takenfrom the schedule and used to generate estimates of the loss function for each candidate.Hoeffding’s bound [Hoeffding, 1963] is then used to produce an upper and lower boundon the true value of the loss function for each candidate algorithm. Any candidatethat has a lower (i.e., best-case) bound that is above the best candidate’s upper bound(i.e., worst-case) is eliminated. The race continues until only one candidate remains

3

or all the pairs of training and validation data have been used to estimate the loss.Note that the race requires an initial burn-in period to gain a reliable estimate ofthe loss function before removing any candidates from the race. This implies that nocandidate will be eliminated for some number of rounds at the beginning of the race.Additionally, because the data is used for multiple comparisons, techniques such asBonferroni correction need to be used to avoid statistical errors [Maron and Moore,1994].

Meta-learning is a discipline that uses machine learning to make predictions abouta dataset as a whole, rather than a particular element in the dataset [Bardenet et al.,2013, Leite et al., 2012, Pfahringer et al., 2000, Vilalta and Drissi, 2002]. One suchmeta-learning technique is landmarking. On a repository of many datasets, a vectorof features of the dataset are computed, such as the number of categorical or numericattributes, the number of prediction labels (only for classification), or the size of thedataset. Additionally, the loss function for a number of different learning algorithms isevaluated on each dataset in the repository. A meta-learner is then trained on thesedataset features to model performance pairs, either predicting the best algorithm for aparticular dataset or providing a ranking over algorithms that should be used on thedataset. Using the formalization outlined in Section 1.1, the meta-learner operates ona dataset with the xi containing the features of datasets used in supervised machinelearning tasks, and the corresponding yi indicating the learning algorithm with the bestperformance. Landmarking suffers from the fact that even with an extensive repositoryof dataset features and performance methods (which requires significant computationalinvestment), it is likely that there may be subsequent machine learning problemsproposed by the user for which the meta-learner will make inaccurate predictions.Such is the pitfall of research exploration in any discipline. Note that determiningwhat learning algorithm to use for the meta-learner is another instance of modelselection, so the algorithm chosen for the meta-learner can heavily influence whichmethods are selected.

Another consideration when performing model selection is the choice of loss function.There may be extra information inside the learning algorithm which may provide abetter indication of its generalization performance to make more accurate predictionson new data. One such measure is Akaike’s entropic information criterion [Bozdogan,1987], known as AIC. AIC represents a compromise between the complexity of thelearned function and the loss estimate, with the idea that functions that are lesscomplex are more likely to generalize to new data, consistent with the principle knownas Occam’s razor. There are similar techniques, such as the Bayesian information

4

criterion [Schwarz, 1978], that provide alternate ways of computing the balance betweenloss and model complexity.

1.3 Hyperparameter optimizationThe problem of optimizing the hyperparameters λ ∈ Λ of a given learning algorithmA is conceptually similar to that of model selection. In both instances, the bestperforming predictive model for a given dataset is desired, but instead of selectingfrom many different learning algorithms the optimization considers a single algorithm’shyperparameters. The hyperparameters of a learning algorithm are often continuous,and their hyperparameter spaces are often high-dimensional. Additionally, it is possibleto exploit the correlation between different hyperparameter settings λ1,λ2 ∈ Λ, acharacteristic with no natural analogue in model selection. Given n hyperparametersλ1, . . . , λn with domains Λ1, . . . ,Λn, the hyperparameter space Λ is a subset of thecrossproduct of these domains: Λ ⊂ Λ1 × · · · × Λn. This subset is often strict, such aswhen certain settings of one hyperparameter render other hyperparameters inactive.For example, the parameters determining the specifics of the third layer of a deepbelief network are not relevant if the network depth is set to one or two. Likewise,the parameters of a support vector machine’s polynomial kernel are not relevant if aradial basis function kernel is used.

More formally, following Hutter et al. [2009], we say that hyperparameter λi isconditional on another hyperparameter λj , if λi is only active if hyperparameter λjtakes values from a given set Vi(j) ( Λj ; in this case, we call λj a parent of λi (andconversely, λi a child of λj). Conditional hyperparameters can in turn be parents ofother conditional hyperparameters, giving rise to a tree-structured space [Bergstraet al., 2011] or, in some cases, a directed acyclic graph (DAG) [Hutter et al., 2009].Given such a structured space Λ, the (hierarchical) hyperparameter optimizationproblem can be formalized as identifying

λ∗ ∈ argminλ∈Λ

1k·k∑i=1L(Aλ,D

(i)train,D

(i)valid).

1.3.1 Previous approaches to solving hyperparameter optimization

Manual tuning of hyperparameter values has often been used in the past, sinceexperienced users may have good intuition about which hyperparameters are likely toinfluence the performance of their learning algorithm most. By iteratively trying new

5

hyperparameter settings, a user can home in on those that perform well. However,this can be a time-consuming process and can nevertheless often result in suboptimalperformance. The weaknesses of manual tuning may be particularly apparent whenthe user’s intuition is not valid for their specific problem.

Rather than relying on a user to guide the choice of hyperparameter values, gridsearch [Friedman et al., 2009] is one of the simplest automatic alternatives. Gridsearch requires that each hyperparameter λi in the hyperparameter space be treateddiscretely. Each numeric hyperparameter is discretized between some minimal andmaximal value, while categorical hyperparameters remain unchanged. The set ofgrid points is then defined to be the Cartesian product of each of the now discreteλi. At each of these grid points, the loss function is computed for all of the pairs(folds) of training and validation data. The hyperparameter settings with the bestperformance over this grid are then used. Note that due to the combinatorial natureof grid search, this can be quite a computational burden if the discretization is fineor (particularly) if there are many hyperparameters. This can be partially addressedby starting out with a very coarse discretization, then refining the upper and lowerbounds of hyperparameters to explore the area around the grid point with the bestperformance in the previous iteration [Van Gestel et al., 2004].

Grid search also suffers from the fact that often only a few hyperparameters areresponsible for most of the performance of a learning algorithm. In order to preventa combinatorial explosion of grid points, each hyperparameter is discretized into arelatively small number of values. While the total number of different hyperparametercombinations examined over the course of the grid search is often quite high, eachindividual hyperparameter only has a few possible values tested. This is particularlyproblematic as the few hyperparameters that are responsible for a large portion of theperformance variation of the learning algorithm receive the same amount of attentionas the hyperparameters that do not greatly affect performance.

By sampling values for all hyperparameters at random, important hyperparam-eters will take on many different values, resulting in a more effective search of thehyperparameter space. Using this random search, Bergstra and Bengio [2012] showedthat with fewer resources, the performance of selected hyperparameter values wasbetter than both grid search and expert manual tuning. Like grid search, randomsearch is also trivially parallelizable; by performing independent runs of the searchwith different random seeds on all available machines, it is easy to take advantage oflarge compute clusters or cloud computing to simultaneously examine many differenthyperparameter values.

6

Evolutionary techniques have also been successfully applied to hyperparameteroptimization, such as in work by Guo et al. [2008], where a particle swarm optimizertuned the hyperparameters of a support vector machine. In work by Jin and Sendhoff[2008], evolutionary algorithms for multiobjective optimization were applied to setthe hyperparameters and the complexity of the learned model. These techniquesare promising, since they make few assumptions about the underlying optimizationproblem and are able to handle scenarios with many parameters, such as the work ofGuo et al. [2008] which optimized 15 hyperparameters.

If all hyperparameters are numeric and the performance of the learning algorithmis well-behaved with respect to the hyperparameters, gradient-based techniques can beused [Bengio, 2000]. The gradient information can be computed directly or empiricallyapproximated. One of the most popular of these techniques is stochastic gradientdescent (SGD, Bottou [1998]). SGD is especially appealing for cases with large amountsof data, since the partial gradient information can be computed using mini-batchesof all the data, making it possible to optimize performance for datasets that cannotbe loaded into memory. Like all gradient-based techniques, if the performance of thelearning algorithm’s loss function is convex, SGD will not end up trapped in a localminimum, resulting in optimal hyperparameter settings.

Recently, techniques from Bayesian optimization have been used to search overhyperparameters: Snoek et al. [2012] used Gaussian processes and Bergstra et al.[2011] used a tree of Parzen estimators to find good hyperparameter settings. Thesemethods have been shown to perform better than either grid or random search; inparticular, Bergstra et al. [2011] were able to find hyperparameter settings for a deepbelief network that surpassed the state of the art on a variant of the MNIST characterrecognition dataset.

There also exist various techniques that optimize hyperparameters for a specificfamily of learning algorithms. For example, Strijov and Weber [2010] used coherentBayesian inference to adjust the coefficients in their parametric regression procedure.The drawback of such targeted optimization approaches is that they rely heavily onthe specifics of the algorithm they are optimizing, making them difficult to transfer todifferent learning algorithms.

7

Chapter 2

CASH and algorithms forsolving it

The combined algorithm selection and hyperparameter optimization (CASH) prob-lem formally defines the challenge of simultaneously solving the selection of machinelearning algorithms and choosing the associated hyperparameter values of a particularalgorithm. Solutions to this problem have large practical importance to the machinelearning community, as users seek to leverage state-of-the-art algorithms for theirresearch. Given a set of algorithms A = {A(1), . . . , A(k)} with associated hyperparam-eter spaces Λ(1), . . . ,Λ(k) and disjoint pairs of training and validation data D(i)

train andD(i)

valid, the goal in solving the CASH problem is to find:

A∗λ∗ ∈ argminA(j)∈A,λ∈Λ(j)

1k·k∑i=1L(A(j)

λ ,D(i)train,D

(i)valid). (2.1)

We note that this problem can be reformulated as a single combined hierarchicalhyperparameter optimization problem with parameter space Λ = Λ(1)∪· · ·∪Λ(k)∪{λr},where λr ∈ A(1), . . . , A(k) is a new root-level hyperparameter that selects betweenalgorithms A(1), . . . , A(k). The root-level parameters of each subspace Λ(i) are madeconditional on λr being instantiated to Ai.

Given the extensive literature on model selection and hyperparameter optimizationand in light of the problem’s practical importance, we are surprised to have foundonly limited variants of the CASH problem to have been studied. Furthermore,each of these variants is applicable only to a fixed and relatively small number ofparameter configurations for each algorithm. For example, in the meta-learning based

8

work of Leite et al. [2012], a total of 292 algorithm-hyperparameter combinationswere considered, spanning six different learning algorithms, while Sun and Pfahringer[2013] present another meta-learning approach that considers twenty learning learningalgorithms over 466 datasets.

We agree that it is very challenging to search the combined space of learningalgorithms and their hyperparameters, because the space is high-dimensional, involvingboth categorical and continuous choice, and the response function is noisy due tothe limited quantities of the validation data. Furthermore, the search space containshierarchical dependencies; for example, the hyperparameters of a learning algorithmare only meaningful if that algorithm is chosen, or the base algorithm choices in anensemble method are only meaningful if that particular ensemble method is chosen.The remainder of this chapter describes a number of possible procedures for solvingCASH, adapting existing selection and optimization strategies from the literature.The first three methods, described in Section 2.1, are either simple approaches or arealready in wide use by the machine learning community, while the last three methods,detailed in Section 2.2, all employ more complex optimization strategies.

2.1 BaselinesIn principle, a solution to the CASH problem may be identified in a variety ofways. Our Exhaustive-Default (Ex-Def) technique was implemented as a rudimentaryapproach using minimal computational resources. To use Ex-Def, the user obtainsimplementations of a number of different learning algorithms that are applicable totheir specific learning task and dataset. Ex-Def then computes the standard k-foldcross-validation for each learning algorithm, leaving hyperparameters at their defaultvalues as set by the implementers of each learning algorithm. After these computationsare completed, the learning algorithm with the best performance is selected by Ex-Defto be used on the dataset. Note that this simple selection technique is likely unable toproduce optimal performance, since it does not optimize hyperparameters beyond thedefaults for the particulars of the given dataset.

Users with more computational resources at their disposal may employ a grid searchtechnique, where the grid is the union of the distinct sub-grids for each of the availablelearning algorithms. While grid search can require extensive CPU time budgets foroptimizing the hyperparameters for a single learning algorithm, this cost only increaseslinearly with the number of learning algorithms that is considered. Setting up such agrid search can also be labour-intensive, even using readily available research tools, such

9

as those found in the open source machine learning package WEKA. WEKA providestwo implementations of grid search for tuning the hyperparameters of a single learningalgorithm; the first can optimize any number of top-level hyperparameters, while thesecond can optimize any two hyperparameters, including nested ones. However, theuser has to define the minimal and maximal values for each numeric hyperparameter.In order to perform a grid search to solve CASH, the user would have to prepare anumber of different grid search experiments using these tools, and select amongst thebest models from each of the smaller grid searches.

Random search alleviates some of the drawbacks of grid search and may be appliedto CASH in a straightforward way. Samples for the random search are created bysimply selecting a learning algorithm at random, then randomly sampling values foreach of the hyperparameters (and children of the active hyperparameters) that areassociated with the chosen algorithm. As described in Section 1.3.1, random searchoffers several advantages over grid search.

2.2 Model-based methodsA promising approach to solving CASH is model-based optimization [Zlochin et al.,2004]. This approach generates a predictive model of the underlying optimizationproblem and uses this model in some manner that guides the optimization process. Inparticular, the Bayesian approach of Sequential Model-Based Optimization (SMBO)[Hutter et al., 2011], a versatile stochastic optimization framework that can workexplicitly with both categorical and continuous hyperparameters, has the ability toexploit hierarchical structure stemming from conditional parameters that are prevalentin CASH. As outlined in Algorithm 1, SMBO first builds a modelML that captures thedependence of loss function L on hyperparameter settings λ (line 1 in Algorithm 1).It then iterates the following steps: use ML to determine a promising candidateconfiguration of hyperparameters λ to evaluate next (line 3), evaluate the loss c ofλ (line 4), and update the model ML with the new data point (λ, c) obtained (lines5–6).

In order to select the next hyperparameter configuration λ using model ML,SMBO uses a so-called acquisition function aML : Λ 7→ R, which uses the predictivedistribution of model ML at arbitrary hyperparameter configurations λ ∈ Λ toquantify (in closed form) how useful knowledge about λ would be. SMBO thensimply maximizes this function over Λ to select the most promising configurationλ to evaluate next. Several well-studied acquisition functions exist [Jones et al.,

10

Algorithm 1 SMBOInput: Algorithm A with hyperparameter space Λ, k pairs of D(i)

train,D(i)valid, time

budget for optimizationOutput: λ ∈ Λ with best performance.

1: initialise model ML; H ← ∅2: while time budget for optimization has not been exhausted do3: λ, ı← candidate configuration and dataset pair id from ML

4: Compute c = L(Aλ,D(i)train,D

(i)valid)

5: H ← H∪ {(λ, c, i)}6: Update ML based on H7: end while8: return λ from H with minimal c

1998, Schonlau et al., 1998, Srinivas et al., 2010]; all aim to automatically trade offexploitation (locally optimizing hyperparameters in regions known to contain goodsettings) versus exploration (trying hyperparameter settings in relatively unexploredregions). In this work, we maximized positive expected improvement (EI) attainableover an existing given loss value cmin [Schonlau et al., 1998]; the EI is high forhyperparameter configurations with high uncertainty and good predicted performancein the model. Let c(λ) denote the loss achieved by hyperparameter configuration λ.Then, the positive improvement function over cmin is defined as

Icmin(λ) := max{cmin − c(λ), 0}.

Of course, we do not know c(λ). We can, however, compute its expectation withrespect to the current model ML:

EML [Icmin(λ)] =∫ cmin

−∞max{cmin − c, 0} · pML

(c | λ) dc. (2.2)

While SMBO algorithms are well suited to solving CASH, other model-basedtechniques are also applicable. We now review two SMBO algorithms and one generalmodel-based optimization algorithm that are capable of handling the hierarchicalhyperparameters prevalent in CASH. The first algorithm has been predominantly usedfor algorithm configurations, while the last two have been used before to performhyperparameter optimization. To our knowledge, these algorithms have not previouslybeen used to consider many different learning algorithms simultaneously.

11

2.2.1 Sequential model-based algorithm configuration (SMAC)

Sequential model-based algorithm configuration [SMAC; Hutter et al., 2011] has beenpredominantly used for the task of algorithm configuration, determining the parametersof solvers for (often hard) computational problems in order to produce either higherquality solutions or faster run times for tasks such as boolean satisfiability andmixed integer programming. CASH is conceptually similar to algorithm configuration,since parameter settings for industry-standard solvers are often a mix of categoricaland numeric parameters, and may include conditional parameters. SMAC supportsa variety of models p(c | λ) to capture the dependence of the loss function c onhyperparameters λ, including approximate Gaussian processes and random forests.In this thesis we used random forest models, since they tend to perform well withdiscrete and high-dimensional input data. SMAC handles conditional parameters byinstantiating inactive conditional parameters in λ to default values for model trainingand prediction. This allows individual decision trees to include splits of the kind ‘ishyperparameter λi active?’, allowing them to focus on active hyperparameters. SMACobtains a predictive mean µλ and variance σλ2 of p(c | λ) as frequentist estimates overthe predictions of its individual trees for λ; it then models pML(c | λ) as a GaussianN (µλ, σλ2).

SMAC uses the expected improvement criterion defined in Equation 2.2, instanti-ating cmin to the error rate of the best hyperparameter configuration measured so far.Under SMAC’s predictive distribution pML(c | λ) = N (µλ, σλ2), this expectation canbe expressed in closed form as:

EML [Icmin(λ)] = σλ · [u · Φ(u) + ϕ(u)],

where u = cmin−µλσλ

, and ϕ and Φ denote the probability density function and cumula-tive distribution function of a standard normal distribution, respectively [Jones et al.,1998]. A multi-start local search procedure is used to select the next hyperparameterconfigurations to evaluate, using ten hyperparameter configurations already consideredby SMAC with the largest EI as starting points. The local search greedily considers aset of neighbouring hyperparameter settings, where neighbours differ in one hyper-parameter value, and terminates when there are no neighbours with a higher EI. Anadditional 10 000 random hyperparameter configurations are also considered amongthe possible configurations to evaluate next. The EI of this combined set of 10 010hyperparameter configurations is then computed from the predictive model, and theconfiguration with the largest EI is selected. Note that this local search process is

12

computationally cheap, since it only queries the predictive model and can be furtheroptimized since many of the predictions are relatively nearby in the hyperparameterspace.

SMAC was designed for robust optimization under noisy function evaluations, andas such implements special mechanisms to keep track of its best known configurationand assure high confidence in its estimate of that configuration’s performance. Thisrobustness against noisy function evaluations can be leveraged in combined algorithmselection and hyperparameter optimization, since the function to be optimized inEquation (1.1) is a mean over a set of loss terms (each corresponding to one pair ofD(i)

train and D(i)valid constructed from the training set). A key idea in SMAC is to make

progressively better estimates of this mean by evaluating the loss terms one at a time,thereby trading off accuracy for computational cost. In order for a new configurationto become a new incumbent (the current best found so far), it must outperform theprevious incumbent in every comparison made: considering only one fold, two folds,and so on, up to the total number of folds previously used to evaluate the incumbent.Furthermore, every time the incumbent survives such a comparison, it is evaluated ona new fold, up to the total number available, meaning that the number of folds usedto evaluate the incumbent grows over time. This also allows for a poorly performingconfiguration to be removed from consideration after evaluating it on a single fold.

Finally, SMAC implements a diversification mechanism to achieve robust perfor-mance even when its model is misled, and to explore new parts of the space: everyother configuration is selected uniformly at random. These randomly selected pointsimprove the accuracy of the model and will not significantly hamper SMAC’s progressif it has found a high quality region of the search space. Because of the evaluationprocedure just described, this requires less overhead than one might imagine.

2.2.2 Tree-structured Parzen estimator (TPE)

The Tree-structured Parzen Estimator [TPE; Bergstra et al., 2011] is an optimizationtechnique specifically designed for hyperparameter optimization. While SMAC modelsp(c | λ) explicitly, TPE uses separate models for p(c) and p(λ | c). Specifically, itmodels p(λ | c) as one of two density estimates, conditional on whether c is greater orless than a given threshold value c∗:

p(λ | c) =

`(λ), if c < c∗.

g(λ), if c ≥ c∗.

13

Here, c∗ is chosen as the γ-quantile of the losses TPE obtained so far (where γ isan algorithm parameter with a default value of γ = 0.15), `(·) is a density estimatelearned from all previous hyperparameter settings λ with corresponding loss smallerthan c∗, and g(·) is a density estimate learned from all previous hyperparametersettings λ with corresponding loss greater than or equal to c∗. Intuitively, this createsa probabilistic density estimator `(·) for hyperparameter settings that appear to do‘well’, and a different density estimator g(·) for hyperparameter settings that appear‘poor’ with respect to the threshold. Bergstra et al. [2011] showed that the expectedimprovement EML [Icmin(λ)] from Equation 2.2 is proportional to

(γ + g(λ)

`(λ) · (1− γ))−1

.

TPE maximizes this expression by generating many candidate hyperparameter config-urations at random from `(·) and picking a λ that minimizes g(λ)/`(λ).

The density estimators `(·) and g(·) have a hierarchical structure with continuous,discrete, and conditional variables reflecting the hyperparameters and their dependencerelationships. For each node in this tree structure, a 1D Parzen estimator is createdto model the probability density function of the node’s corresponding hyperparameter.For a given hyperparameter configuration λ that is added to either ` or g, onlythe 1D estimators corresponding to active hyperparameters in λ are updated. Forcontinuous hyperparameters, these 1D estimators are constructed by placing densityin the form of a Gaussian at each hyperparameter value λi, with standard deviationset to the larger of each value’s left and right neighbours. Discrete hyperparametersare estimated with probabilities proportional to the number of times that a particularchoice occurred in the set of observations. To evaluate a candidate hyperparameterλ’s probability estimate, TPE starts at the root of the tree and descends into theleaves by following paths that only use active hyperparameters. At each node in thistraversal, the probability of the corresponding hyperparameter is computed accordingto its 1D estimator, and the individual probabilities are combined on a pass backup to the root of the tree. Note this means that TPE assumes independence forhyperparameters that do not appear together along any path from the tree’s root toone of its leaves. This assumption can be problematic, since it does not account forthe case that the interactions between sibling hyperparameters are responsible forperformance differences.

14

2.2.3 Iterated F-Race (I/F-Race)

Iterated F-Race [I/F-Race; Balaprakash et al., 2007] belongs to the more generalfamily of model-based optimization algorithms, and as the name suggests, uses aracing procedure at its core. Like SMAC, I/F-Race has been primarily used foralgorithm configuration tasks, such as a solver for scheduling problems [Dubois-Lacoste et al., 2011]. Candidates for the race are sampled randomly, and conditionalhyperparameters are supported by sampling child hyperparameters only when theirparent hyperparameter is active. I/F-Race can be used to solve CASH by treating thechoice of which learning algorithm to use as a root-level hyperparameter.

Recall that Hoeffding races use Hoeffding’s bound in order to assess the likelyperformance of a racing candidate, and this bound can often be quite loose. F-Race[Birattari et al., 2002] replaces the bound with the non-parametric Friedman test[Conover, 1998] to find inferior candidates. This test considers the ranks of all thecandidates for each pair of training and validation data used so far in the race,and indicates if there exists some number of candidates which tend to yield betterperformance than at least one other. As soon as the Friedman test detects the presenceof such a difference, pairwise test statistics are computed between the candidates toeliminate the candidates with poor performance. Unlike Hoeffding races, F-Race doesnot use any form of multiple testing correction when comparing candidates.

Note that F-Race is unable to select different learning algorithms or new values forhyperparameters once the race has begun, so the initial number of racing candidatesshould be quite large in order to ensure high performance. The initial candidates canbe generated, for example, by either using all the points in a grid search or throughrandom sampling. Since racing algorithms require a few iterations before they canbegin to eliminate candidates, this still means that a large portion of the computationalresources will be spent investigating algorithms and hyperparameter settings thatare not even close to optimal. I/F-Race solves this problem by performing manyrounds of a modified F-Race procedure on a more manageable number of candidates,each time randomly sampling new candidates from the space of learning algorithmsand hyperparameters. The modifications from the standard F-Race procedure are inthe termination conditions; the race is terminated if either the number of survivingcandidates drops below a fixed threshold, the race has used at least some number offolds of the dataset, or some computational budget has been used. These thresholdsare all set adaptively based on the specifics of the problem I/F-Race is optimizing. Assoon as a (fixed) small number of candidates remain, the round is terminated, and the

15

sampling distributions are updated to be more concentrated around the algorithmsand values for hyperparameters that appear to provide good performance.

More specifically, in the first round of I/F-Race, all the algorithms and theirhyperparameters are sampled uniformly at random. Once a round of F-Race isterminated, the surviving candidates are ranked by their performance. To generatenew candidates for the next round of the race, I/F-Race first samples from the survivorsof the previous round inversely proportionally to their rank (candidates with highperformance are more likely to be sampled). A new candidate λ′s = (λ′1, . . . λ′d) isthen generated from the sampled survivor λs = (λ1, . . . λd) by setting λ′i ∼ N (λi, σ′i),where:

σ′i = σi · (1/Nmax)1/d

In this equation, Nmax is the initial number of candidates used at the beginning ofan iteration of I/F-Race. This approach was designed to reduce the volume of thesampled hyperparameter space at a constant rate each iteration, resulting in generatingcandidates in subsequent iterations that are concentrated around hyperparametervalues that were successful in previous iterations.

When I/F-Race finishes the final round of racing, it is possible that there aremany candidates without sufficient evidence to indicate which is best. In this case,I/F-Race selects the candidate that has the best performance measured from theused pairs (folds) of training and validation data. Like TPE, I/F-Race also assumesindependence between hyperparameters (therefore it will not be able to capture theinteraction between sibling hyperparameters in the model), and only samples childhyperparameters when their parents are active.

16

Chapter 3

Auto-WEKA

To demonstrate the feasibility of an automatic approach to solving the CASH problem,we built a tool, Auto-WEKA, that solves this problem for all classification andregression algorithms in combination with all feature selectors/evaluators implementedin the standard WEKA package [Hall et al., 2009].

Table 3.1 provides a list of all 39 WEKA learning algorithms. Of these methods,27 are considered base algorithms (which can be used independently), 10 of theremaining algorithms are meta-methods (which take a single base algorithm and itsparameters as an input), and the final 2 ensemble algorithms can take any number ofbase algorithms as input. We allowed the meta-methods to use any base algorithmwith any hyperparameter settings, and allowed the 2 ensemble methods to use up tofive of the 27 base algorithms, again with any hyperparameter settings. Auto-WEKAautomatically determines which algorithms are applicable to each dataset, ensuringthat regression algorithms are used when the predictions are numeric, and classificationalgorithms are used when the prediction is categorical. Additionally, Auto-WEKAavoids the use of algorithms that are incompatible with a given dataset due to issuessuch as missing feature values.

Table 3.2 provides a list of WEKA’s three feature search methods and its eightfeature evaluators along with their respective numbers of hyperparameters, up to fivefor search and up to four for evaluators. To perform feature selection, a search methodis combined with a feature evaluator, and the hyperparameters of both need to beinstantiated. Feature selection is run as a preprocessing phase before the training ofany learning algorithm begins.

The algorithms in Table 3.1 and 3.2 have a wide variety of hyperparameters,which take values from continuous intervals, from ranges of integers, and from other

17

Table 3.1: Learning algorithms in Auto-WEKA. ∗ indicates meta-methods, whichin addition to their own parameters take one base algorithm and its parameters. +

indicates ensemble methods that take as input up to 5 base algorithms and theirparameters. We report the number of categorical and numeric hyperparameters foreach method.

Algorithm Cat. Num. Algorithm Cat. Num.

Bayes Net 2 0 C4.5 Decision Tree 6 2Naive Bayes 2 0 Logistic Model Tree 5 2Naive Bayes Multinomial 0 0 M5 Tree 3 1Gaussian Process 3 6 Random Forest 2 3Linear Regression 2 1 Random Tree 4 4Logistic Regression 0 1 REP Tree 2 3Single-Layer Perceptron 5 2Stochastic Gradient Descent 3 2 Locally Weighted Learning∗ 3 0SVM 4 6 AdaBoostM1∗ 2 2Simple Linear Regression 0 0 Additive Regression∗ 1 2Simple Logistic Regression 2 1 Attribute Selected∗ 2 0Voted Perceptron 1 2 Bagging∗ 1 2KNN 4 1 Classification via Regression∗ 0 0K-Star 2 1 LogitBoost∗ 4 4Decision Table 4 0 MultiClass Classifier∗ 3 0RIPPER 3 1 Random Committee∗ 0 1M5 Rules 3 1 Random Subspace∗ 0 21-R 0 1PART 2 2 Voting+ 1 00-R 0 0 Stacking+ 0 0Decision Stump 0 0

discrete sets. We associated either a uniform or log-uniform prior with each numericalparameter, depending on its semantics and a brief survey of chosen values from theliterature. For example, we set a log-uniform prior for the ridge regression penalty, anda uniform prior for the maximum depth for a tree in a random forest. Auto-WEKAworks with continuous hyperparameter values up to the precision of the machine it isrun on; nevertheless, to give a sense of the size of the space we studied, we note thatdiscretizing hyperparameter domains to a maximum of 10 values each gives rise toover 1047 hyperparameter settings. We emphasize that this space is much larger thana simple union of the base learners’ hyperparameter spaces (whose size is roughly 108),since the ensemble methods allow up to 5 independent base learners, giving rise toa space with roughly (108)5 = 1040 elements. Feature selection gives rise to anotherindependent decision between roughly 106 choices, and several parameters on theensemble and meta-level contribute another order of magnitude to the total size ofAutoWEKA’s hyperparameter space.

Auto-WEKA can be thought of as a single learning algorithm with a highly

18

Table 3.2: Feature Search/Evaluator methods in Auto-WEKA. ∗ indicates searchmethods requiring one feature evaluator that is used to determine the importance of afeature.

Feature Method Categorical Numeric

Best First∗ 1 1Greedy Stepwise∗ 3 2Ranker∗ 0 1

CFS Subset Eval 2 0Pearson Correlation Eval 0 0Gain Ratio Eval 0 0Info Gain Eval 2 01-R Eval 1 2Principal Components Eval 2 2RELIEF Eval 1 2Symmetrical Uncertainty Eval 1 0

conditional hyperparameter space. As depicted in Figure 3.1, Auto-WEKA hastwo top-level Boolean parameters. The first, is base, selects among single baselearning algorithms and ensemble or meta-algorithms. If is base is true, then theparameter base determines which of the 27 base-methods are to be used. If is base

is false, then learner indicates either an ensemble or a meta-algorithm. If learner

is a meta-algorithm, then the parameter meta base is chosen to be one of the 27base algorithms. In the event that learner is an ensemble algorithm, an additionalparameter num learners, an integer chosen from {1, . . . , 5}, determines the numberof base algorithms to be used. base i variables are then selected according to thevalue of num learners, each determining which of the 27 base algorithms to use. Foreach base parameter, hyperparameters for all the base algorithm are attached andmade conditional upon base selecting the corresponding base algorithm.

Auto-WEKA’s second top-level Boolean parameter feat sel determines whetherto apply one of the feature selection methods. If feat sel is false, then Auto-WEKApasses the unmodified dataset to the learning algorithm. If it is true, then feat ser

selects the choice of feature search method, and feat eval selects the choice of featureevaluator (with conditional hyperparameters attached). This results in a very widetree that captures the hierarchical nature of the hyperparameters and allows thecreation of a single hyperparameter optimization problem with four hierarchical layers,consisting of a total of 786 parameters for classification problems and 472 parametersfor regression problems. This is because there are far fewer base algorithms that areable to make numeric predictions in WEKA than can make categorical predictions.

19

feat_sel

feat_ser feat_eval

falsetrue

. . .directionnon-improving nodeslookup cache

fwd./bkwd.conservativethreshold

Best FirstGreedy Stepwise

. . .num neighboursweight by distance...

missing as separateinclude locally predictive

RELIEFCFS Subset

. . .

true

(none)

Figure 3.1: Auto-WEKA’s top-level parameters. Top: is base controls Auto-WEKA’schoice of either using a base algorithm or the using either a meta or ensemble learner.The triangular items represent a parameter that selects one of the 27 base algorithmsand associated hyperparameters. Bottom: feat sel controls Auto-WEKA’s choice offeature selection methods.

Since Auto-WEKA is agnostic about the choice of optimizer, we implementedvariants leveraging SMAC, TPE, and I/F-Race. SMAC, TPE and I/F-Race havetheir own parameters influencing performance, such as TPE’s choice of the γ-quantileindicating ‘good’ or ‘bad’ performance, the number of trees inside SMAC’s randomforest model, or I/F-Race’s number of newly sampled candidates at each iteration.In Auto-Weka, we used the defaults for these meta-hyperparameters, as set by theirrespective authors. Further improvements to what may be obtainaed by optimizingthe meta-hyperparameters, but a separate process with a meta-training/validation setsplit would be required to guard against over-fitting, and we did not attempt this dueto the extreme computational cost of such experiments.

All three model-based optimizers are randomized algorithms and thus produce dif-ferent results based on the random seed provided. As demonstrated in work by Hutteret al. [2012], this allows for trivial, yet effective parallelization of the optimization pro-

20

cess via simply performing k independent runs of the optimization method in paralleland selecting the result of the run with the lowest cross-validation error. Other, moresophisticated methods for the parallelization of Bayesian optimization exist [Hutteret al., 2012, Bergstra et al., 2011, Desautels et al., 2012, Snoek et al., 2012], but to date,there is no empirical evidence that these methods outperform the simple approachwe used here when the cost of evaluating hyperparameter configurations varies acrossthe hyperparameter space. Our SMAC and TPE variants of Auto-WEKA use thesimple parallelization approach, simulating runs on a standard quad-core desktopusing 4 parallel jobs. The authors of I/F-Race, however, specifically designed theiralgorithm to run in parallel during the racing phase. As such, our I/F-Race variant ofAuto-WEKA performs evaluations of candidates in parallel across 4 CPU cores.

Auto-WEKA also has support for various resource constraints. When evaluatingthe performance of a learning algorithm on a pair of training and validation datasets,Auto-WEKA considers both memory and time limits. If the learning algorithm requestsmore than a user-defined threshold of RAM, Auto-WEKA aborts the training of thelearning algorithm (and treats the evaluation as a failure in the optimization method).Auto-WEKA limits the time that can be used for training a learning algorithm oneach pair of training and validation datasets to ensure that the optimization techniquehas a chance to sufficiently explore the search space. The user sets a training budgetin advance, which Auto-WEKA uses to send an interrupt to the learning algorithm tofinish training as soon as possible once the budget has been consumed. The learningalgorithm produces a (partially) trained model in this case which is then used togenerate an error estimate on the validation data. Snoek et al. [2011] presenteda promising approach for using runtime predictions in the expected improvementcalculation to automatically drive the search away from excessively expensive models.While we did not implement such a technique, we see it as an interesting avenue to beexplored in future work.

In addition to supporting large-scale experiments on many datasets simultaneously,Auto-WEKA provides a user-friendly graphical interface. The interface operates intwo modes, the first acting as a wizard (Figure 3.2). In wizard mode, a user specifiestheir dataset and amount of computation time available. Auto-WEKA’s experimentbuilder mode (Figure 3.3) presents additional parameter choices. The first screenaccepts input of training and test data, and additionally specifies the method thatAuto-WEKA will use to generate pairs of training and validation datasets. Withthe second screen, the user customizes the learning algorithms to be included in thesearch, possibly excluding algorithms that may be problematic for the dataset. The

21

final screen sets the optimizer to use and specifies the user’s resource constraints.Both modes then provide a way to perform and monitor the optimization process fordifferent random seeds. After the optimization is complete, Auto-WEKA provides asummary of the performance of the selected algorithm with its hyperparameters, andallows the user to make predictions on new data (Figure 3.4).

Like WEKA, we implemented Auto-WEKA in Java, and the software works bothon UNIX-based and Windows machines. Auto-WEKA and its source code are availableat http://www.cs.ubc.ca/labs/beta/Projects/autoweka/. We are committed to ensuringthat Auto-WEKA remains available to new users.

Figure 3.2: Auto-WEKA’s wizard interface.

22

Figure 3.3: Auto-WEKA’s experiment builder workflow.

23

Figure 3.4: Auto-WEKA’s interface for examining the best learning algorithm andhyperparameters after an experiment has been run.

24

Chapter 4

Evaluating Auto-WEKA

We performed an experimental study on the effectiveness of solving CASH usingAuto-WEKA, comparing its performance against three baseline methods of benchmarkdatasets for both classification and regression problems. Section 4.2 provides detailsof our classification benchmarks, while Section 4.3 provides the same details for ourregression benchmarks. On both types of machine learning problems, we showedthat Auto-WEKA’s enormous hyperparameter space can be searched effectively toachieve low cross-validation error, then examined how well the chosen algorithms andhyperparameters generalize to unseen test data, and finally provided an analysis ofthe types of learning methods selected. Section 4.4 details an investigation into somealternate variants of Auto-WEKA where the training data is leveraged in differentways, investigating impact on performance.

4.1 Experimental setupAll of our experiments were run on Linux machines in Westgrid’s Bugaboo and Orcinusclusters, having dual Intel Xeon X5650 six-core 2.66GHz processors with 24GB ofRAM. We enforced a RAM limit of 3GB for the training of the learning algorithm, andallocated an additional 1GB of RAM for the optimization method inside Auto-WEKA.We limited the CPU time that could be used when training a learning algorithm to150 minutes and allocated 15 minutes for feature search and evaluation. In preliminaryexperiments, few models exceeded this timeout for the datasets studied here. Wechose these limits to be representative of the resource limitations faced by a typicaluser of machine learning algorithms.

In order to evaluate the effectiveness of Auto-WEKA, we compared it against thethree baseline approaches to solving CASH already discussed Section 2.1, namely

25

Ex-Def, grid search and random search. We implemented each method in Java usinga straightforward approach. For each dataset, we considered all applicable learningalgorithms for use in Ex-Def and evaluated the cross-validation performance of thesealgorithms using the default hyperparameter values. Ex-Def would not consider anylearning algorithm for selection if the algorithm breached any of its resource constraintswhile being evaluated. Our grid search uses all applicable base learning algorithms,and optimizes their hyperparameters. For each base learning algorithm, we discretizednumeric hyperparameters as three values (minimum, mean, and maximum). Weimplemented random search by sampling values from the same hyperparameter spacethat would be used by Auto-WEKA on each dataset (using the same uniform orlog-uniform prior as Auto-WEKA), as shown in Figure 3.1.

To estimate the loss function obtained by a particular learning algorithm, weemployed 10-fold cross-validation on the training data for our baselines and Auto-WEKA. SMAC and I/F-Race are capable of operating on a single fold at a time,but TPE does not support this, instead requiring a mean estimate over all folds forhyperparameter settings at once. As such, the wrapper between Auto-WEKA andTPE must compute the full 10-fold cross-validation performance immediately whenconsidering a potential learning algorithm and its corresponding hyperparameters. Ifthe wrapper encounters three consecutive folds where the learning algorithm exceedsthe resource constraints, we assumed that the remaining folds would also exceed theconstraints, and allowed the wrapper to skip evaluating any outstanding folds.

For datasets with a predefined split between training and test data, we used thatsplit. Otherwise, we randomly split the dataset into 70% training and 30% test data(see Tables 4.1 and 4.6). We withheld the test data from all optimization methods; itwas only used once in an offline analysis stage to evaluate the models found by thevarious optimization methods. We denoted datasets with at least 10 000 training datapoints as ‘large’ and anything less as ‘small’.

We assumed that our users have access to a 4-core machine, and designed ourexperiments accordingly. For random search, and the Auto-WEKA variants basedon SMAC and TPE, we performed 25 runs of each process with different randomseeds, then used bootstrap sampling to repeatedly select 4 random runs and reportthe performance of the run among these four with best cross-validation performance.The Auto-WEKA I/F-Race variant is already designed to take advantage of multicoremachines by evaluating racing candidates in parallel. Thus, we simply ran I/F-Race25 times with different random seeds, and reported the mean performance of theseruns.

26

Since both Ex-Def and grid search have a finite number of performance measure-ments to gather, we did not place any time constraints on their completion (detailedruntimes are provided in Table 4.2). We allocated a total CPU time budget of 30 hoursper core for experiments using random search and Auto-WEKA. Both the TPE andSMAC Auto-WEKA variants support setting a time limit directly, while the I/F-Racevariant does not. The I/F-Race algorithm defines a budget in the number of maximumfunction evaluations, and sets a number of other parameters automatically given thisbudget. We first computed the average evaluation time required by learning algorithmsand hyperparameter settings considered over the course of our random search. Wethen used this value measured from random search to determine the expected numberof evaluations that the Auto-WEKA I/F-Race variant would be allowed to perform.Furthermore, if the I/F-Race variant of Auto-WEKA exceeded the 30 hour wall timebudget, we terminated it, selecting the best found algorithm and hyperparametersettings from the completed iterations.

In early experiments, we observed a few cases in which Auto-WEKA’s chosenhyperparameter settings, which had excellent training performance, turned out togeneralize poorly. To help Auto-WEKA detect such overfitting, we partitioned itstraining set into two subsets: 70% for use inside the optimizer, and 30% of validationdata that we only used after the optimizer finished.

4.2 Classification resultsWe evaluated Auto-WEKA on 21 prominent benchmark classification datasets (seeTable 4.1): 15 sets from the UCI repository [Frank and Asuncion, 2010]; the ‘Convex’,‘MNIST Basic’, and ‘Rotated MNIST with Background Images’ tasks used in Bergstraand Bengio [2012]; the appentency task from the KDD Cup ’09; and two versions of theCIFAR-10 image classification task [Krizhevsky and Hinton, 2009]. CIFAR-10-Smallis a subset of CIFAR-10 that includes only the first 10 000 training data points ratherthan the full 50 000. We optimized classification error rate, with 100% correspondingto classifiers that made no correct predictions. After filtering out invalid learningalgorithms, Auto-WEKA searched in a 786-dimensional hyperparameter space.

4.2.1 The importance of solving CASH effectively

To highlight the importance of solving CASH, we looked at the oracle performanceof Ex-Def and grid search on our datasets, which were determined by the minimalclassification error on the test set when a classifier was built using all training data.

27

Table 4.1: Classification datasets used. Num Categorical and Num Numeric refer to thenumber of categorical and numeric attributes of elements in the dataset, respectively.

Name Num Num Num Num NumCategorical Numeric Classes Training Test

Dexter 0 20 000 2 420 180GermanCredit 13 7 2 700 300Dorothea 0 100 000 2 805 345Yeast 0 8 10 1 039 445Amazon 0 10 000 50 1 050 450Secom 0 590 2 1 097 470Semeion 0 256 10 1 116 477Car 6 0 4 1 210 518Madelon 0 500 2 1 820 780KR-vs-KP 36 0 2 2 238 958Abalone 1 7 28 2 924 1 253Wine Quality 0 11 11 3 429 1 469Waveform 0 40 3 3 500 1 500Gisette 0 5 000 2 4 900 2 100Convex 0 784 2 8 000 50 000

CIFAR-10-Small 0 3 072 10 10 000 10 000MNIST Basic 0 784 10 12 000 50 000Rot. MNIST + BI 0 784 10 12 000 50 000Shuttle 38 192 2 35 000 15 000KDD09-Appentency 0 9 7 43 500 14 500CIFAR-10 0 3 072 10 50 000 10 000

These performances should be viewed as a best case to be expected, since the oracle isable to peek at the results and select the best (and additionally the worst seen by theoracle) overall. We note that the oracle best performance for Ex-Def provides a lowerbound on the classification error that can be achieved via any method that performsonly algorithm selection. In Table 4.2 (left), we observed that the gap between manyof the best and worst classifiers was huge, for example misclassification rates of 4.93%vs 99.24% on the Dorothea dataset. Even when we artificially restricted the setof classifiers that Ex-Def examined to a hand-selected set of particularly popularalgorithms (neural networks, random forests, SVMs, AdaBoost, C4.5 decision trees,logistic regression, and KNN), this gap still exceeded 20% on 14 out of the 21 datasets.Furthermore, there was no single classifier that achieved good performance across alldatasets; every method was at least 22% worse than the best for at least one data set.This highlights just how important selecting the correct learning algorithm can be forachieving good performance.

Table 4.2 (right) shows the oracle performance of grid search, in which the bestand worst generalization performance is determined from all the evaluated grid points.Note that the oracle performance of grid search outperformed Ex-Def (for example, in

28

Table 4.2: Oracle performance of Ex-Def and grid search.

NameOracle Ex-Def Oracle Grid Search

Best Worst Gap Time Best Worst Gap Time(%) (%) (%) (Hours) (%) (%) (%) (Hours)

Dexter 7.78 52.78 45.00 9.11 3.89 63.33 59.44 2, 710GermanCredit 26.00 38.00 12.00 0.09 25.00 68.00 43.00 327Dorothea 4.93 99.24 94.31 26.17 4.64 44.64 40.00 2, 622Yeast 40.00 68.99 28.99 0.06 36.85 69.89 33.03 35.41Amazon 28.44 99.33 70.89 5.11 17.56 99.33 81.78 6, 993Secom 7.87 14.26 6.38 5.83 7.66 92.13 84.47 4, 646Semeion 8.18 92.45 84.28 1.77 5.24 92.45 87.21 913Car 0.77 29.15 28.38 0.07 0.00 46.14 46.14 21.36Madelon 17.05 50.26 33.21 5.92 20.64 62.69 42.05 4, 786KR-vs-KP 0.31 48.96 48.64 0.17 0.21 51.04 50.84 71.11Abalone 73.18 84.04 10.85 0.39 72.15 92.90 20.75 420Wine Quality 36.35 60.99 24.64 0.24 32.88 99.39 66.51 217Waveform 14.27 68.80 54.53 0.39 13.47 68.80 55.33 361Gisette 2.52 50.91 48.39 18.45 1.81 51.23 49.42 16, 916Convex 25.96 50.00 24.04 14.20 19.94 71.49 51.55 15, 747

CIFAR-10-Small 65.91 90.00 24.09 13.61 52.16 90.36 38.20 17, 847MNIST Basic 5.19 88.75 83.56 15.42 2.58 88.75 86.17 14, 040Rot. MNIST + BI 63.14 88.88 25.74 7.86 55.34 93.01 37.67 14, 280Shuttle 0.0138 20.8414 20.8276 3.39 0.0069 89.8207 89.8138 2, 036KDD09-Appentency 1.7400 6.9733 5.2333 7.48 1.6332 54.2400 52.6068 5, 741CIFAR-10 64.27 90.00 25.73 3.65 55.27 90.00 34.73 11, 364

the CIFAR-10-small task, grid search offered a 13% reduction in error over Ex-Def),since it is able to perform both algorithm selection and hyperparameter optimization.However, this boost in performance does not come without computational cost. Thetime columns of Table 4.2 detail the number of CPU hours that were needed in order tocomplete Ex-Def and grid search. In particular, observe that nearly 14 CPU years wererequired for the grid search experiments, with all of Gisette, Convex, MNIST Basic,Rot MNIST + BI and both CIFAR variants requiring over 10 000 CPU hours each.These resource requirements render grid search infeasible for use in most practicalapplications.

4.2.2 Results for training performance

With 786 hierarchical hyperparameters, Auto-WEKA’s combined algorithm / hy-perparameter space is enormous. We next studied how effectively Auto-WEKAvariants, using SMAC, TPE and I/F-Race, searched this space to optimize 10-foldcross-validation performance, and compared their performance to that of Ex-Def, gridsearch and random search. Table 4.3 shows our results.

29

Table 4.3: Training performance on classification datasets (Error %). Bold entriesdenote performance statistically insignificant from the best, according to a Welch’s ttest with p = 0.01.

Name Ex-Def Grid Lim. Grid Rand. Auto-WEKASearch Search Search I/F-Race TPE SMAC

Dexter 10.20 5.07 6.61 10.60 6.53 9.80 6.43GermanCredit 22.45 20.20 20.28 20.15 21.38 21.26 19.00Dorothea 6.03 6.73 7.17 8.11 6.30 6.82 5.70Yeast 39.43 39.71 39.71 38.74 37.91 35.01 35.97Amazon 43.94 36.88 56.16 59.85 51.95 50.36 47.60Secom 6.25 6.12 6.24 5.24 5.87 6.21 5.31Semeion 6.52 4.86 5.14 6.06 5.30 6.76 4.89Car 2.71 0.83 0.83 0.53 0.97 0.91 0.57Madelon 25.98 26.46 29.07 27.95 23.11 24.25 22.24KR-vs-KP 0.89 0.64 0.64 0.63 0.59 0.43 0.35Abalone 73.33 72.15 72.19 72.03 72.30 72.14 71.66Wine Quality 38.94 35.23 35.40 35.36 35.53 35.97 34.63Waveform 12.73 12.45 12.47 12.43 12.59 12.55 11.99Gisette 3.62 2.59 5.45 4.84 2.42 3.56 2.42Convex 28.68 22.36 30.58 33.31 28.49 28.56 26.11

CIFAR-10-Small 66.59 53.64 62.52 67.33 66.67 58.41 58.26MNIST Basic 5.12 2.51 4.20 5.05 4.84 9.99 3.73Rot. MNIST + BI 66.15 56.01 62.19 68.62 65.03 73.09 60.17Shuttle 0.0328 0.0361 0.0368 0.0345 0.0306 0.0251 0.0190KDD09-Appentency 1.8776 1.8735 1.8772 1.7510 1.8006 1.8776 1.7528CIFAR-10 65.54 54.04 63.60 69.46 70.55 67.77 61.12

Grid search over the hyperparameters of all base-classifiers yielded better resultsthan Ex-Def in 17/21 cases, underlining the importance of not only choosing theright algorithm but of also setting its hyperparameters well. Recall that grid searchoften requires a huge amount of time to run, so we also reported the results of agrid search limited to 120 hours of CPU time, the same amount which we gave to allother methods. These results were computed by bootstrap sampling points from thefull grid search data until the time required to evaluate the selected points exceeded120 CPU hours. Random search outperformed limited grid search (which has thesame time budget) in 7/21 cases, was statistically indistinguishable using a Welch’s ttest with p-value 0.01 on 8/21 datasets, and was outperformed on 6/21. Even whencomparing against the full grid search, random search still provided better performanceon 7/21 datasets. This highlights that even exhaustive grid search with a large timebudget is not always optimal since good hyperparameter settings may lie off the griddiscretization.

Auto-WEKA was able to improve on the performance obtained by all threebaselines. The best Auto-WEKA method on each dataset outperformed the best

30

baseline in 10/21 cases. However, when we only considered the baselines that weregiven the same 120 CPU hour budget, the best Auto-WEKA method outperformedthe best time-limited baseline in 16/21 cases, and was only worse in 3/21 cases. TheAuto-WEKA variant based on SMAC outperformed the TPE variant in 20/21 cases,and outperformed I/F-Race variant in 15/21 cases. We also note that sometimes therelative improvement over the time-limited baselines was substantial, with relativereductions of the cross-validation error rate exceeding 5% in 7/21 cases, and exceeding10% in 5/21 cases.

4.2.3 Results for test performance

Auto-WEKA was effective at optimizing its given objective function for classification;however, this is not sufficient to allow us to conclude that it produces models thatgeneralize well. As the number of hyperparameters of a machine learning algorithmgrows, so does its potential for overfitting. Our use of cross-validation substantiallyincreases Auto-WEKA’s robustness against overfitting, but since its hyperparameterspace is much larger than that of standard classification algorithms, it is important tocarefully study whether (and to what extent) overfitting remains a problem.

To evaluate generalization, we determined a combination of algorithm and hyperpa-rameter settings Aλ by running Auto-WEKA as before (cross-validating on the trainingset), trained Aλ on the entire training set, and then evaluated the resulting modelagainst the test set. Table 4.4 reports test set performance obtained with all methods.Broadly speaking, similar trends held as for cross-validation performance: Auto-WEKAoutperformed the baselines, while grid search and random search performed betterthan Ex-Def. However, the performance differences were less pronounced: grid searchonly yielded better results than Ex-Def in 13/21 cases with 3/21 indistinguishablecases, and random search in turn outperformed grid search in 4/21 cases while beingindistinguishable in 4/21 cases. Against the limited variant of grid search, randomsearch provided better performance in 5/21 cases, and was indistinguishable in 7/21cases. This is somewhat surprising, since previous work comparing grid and randomsearch for just hyperparameter optimization showed that random search is typicallypreferable to time-limited grid search [Bergstra and Bengio, 2012]. This may be due tothe size of Auto-WEKA’s hyperparameter space compared to the (relatively) smallerspaces reported in the literature.

The best variant of Auto-WEKA outperformed the best baseline method in 10/21cases, and provided worse performance in 7/21 cases. When comparing the bestvariant of Auto-WEKA against the best time-limited baseline, Auto-WEKA had

31

Table 4.4: Test performance on classification datasets (Error %). Bold entries denoteperformance statistically insignificant from the best, according to a Welch’s t test withp = 0.01.

Name Ex-Def Grid Lim. Grid Rand. Auto-WEKASearch Search Search I/F-Race TPE SMAC

Dexter 8.89 5.00 6.03 9.18 8.11 8.83 8.33GermanCredit 27.33 26.67 26.90 29.03 28.64 27.54 28.57Dorothea 6.96 5.80 5.79 5.22 5.86 6.14 5.98Yeast 40.45 42.47 42.47 43.15 40.04 40.11 39.08Amazon 28.44 20.00 41.43 41.11 34.52 36.64 32.87Secom 8.09 8.09 7.96 8.03 8.06 8.10 7.94Semeion 8.18 6.29 5.61 6.10 6.43 8.25 5.39Car 0.77 0.97 0.97 0.01 0.21 0.18 0.29Madelon 21.38 21.15 23.63 24.29 22.70 21.58 21.84KR-vs-KP 0.31 1.15 1.15 0.58 0.61 0.54 0.31Abalone 73.18 73.42 73.65 74.88 73.89 72.94 73.80Wine Quality 37.51 34.06 33.98 34.41 33.85 33.57 33.52Waveform 14.40 14.66 14.40 14.27 14.15 14.23 14.20Gisette 2.81 2.40 4.68 4.62 2.17 3.96 2.24Convex 25.96 23.45 27.67 31.20 26.34 25.60 23.13

CIFAR-10-Small 65.91 56.94 59.95 66.12 65.66 57.00 56.04MNIST Basic 5.19 2.64 4.14 5.05 4.97 12.24 3.56Rot. MNIST + BI 63.14 57.59 60.05 66.40 65.79 70.20 58.03Shuttle 0.0138 0.0414 0.0340 0.0157 0.0221 0.0144 0.0137KDD09-Appentency 1.7405 1.7400 1.7395 1.7400 1.7400 1.7381 1.7394CIFAR-10 64.27 63.13 62.72 69.72 70.13 66.05 59.65

better performance on 14/21 datasets, and was outperformed on 6/21. Notably, on11 of the 12 largest datasets, Auto-WEKA outperformed, and was never worse, thanthe time-limited baselines; we attribute this to the fact that the risk of overfittingdecreases with dataset size. In some cases, Auto-WEKA’s performance improvementsover the other methods were substantial, with relative reductions of the test error rateexceeding 10% in 3/21 cases.

Amongst the Auto-WEKA variants, again SMAC tended to perform best. AgainstTPE, SMAC provided better generalization error on 16/21 datasets, while TPE outper-formed SMAC on 5/21 datasets. Against I/F-Race, SMAC had better generalizationerror on 11/21 datasets, and was indistinguishable on the remaining 10/21.

As mentioned earlier, Auto-WEKA only used 70% of its training set during theoptimization of cross-validation performance, reserving the remaining 30% for assessingthe risk of overfitting. At any point in time, Auto-WEKA’s optimization methodkeeps track of its incumbent hyperparameter configuration. After its optimizationprocedure has finished, Auto-WEKA examines a trajectory of these incumbents andcomputes their generalization performance on the withheld 30% validation data.

32

Table 4.5: Correlation between the withheld 30% validation data and the training dataperformance. Gap indicates the difference between the mean training performanceand mean test performance from Tables 4.3 and 4.4.

Name I/F-Race TPE SMAC

Test (%) Gap(%) Corr. Test (%) Gap(%) Corr. Test (%) Gap(%) Corr.

Dexter 8.11 -1.58 0.35 8.83 +0.97 0.82 8.33 -1.89 0.48GermanCredit 28.64 -7.26 0.03 27.54 -6.28 0.31 28.57 -9.57 0.11Dorothea 5.86 +0.44 0.17 6.14 +0.68 0.95 5.98 -0.28 0.33Yeast 40.04 -2.13 -0.03 40.11 -5.09 0.36 39.08 -3.10 0.52Amazon 34.52 +17.43 0.63 36.64 +13.72 0.92 32.87 +14.73 0.70Secom 8.06 -2.19 -0.85 8.10 -1.89 -0.10 7.94 -2.63 0.29Semeion 6.43 -1.14 0.39 8.25 -1.50 0.84 5.39 -0.50 0.77Car 0.21 +0.76 0.19 0.18 +0.73 0.12 0.29 +0.29 0.61Madelon 22.70 +0.41 0.34 21.58 +2.67 0.44 21.84 +0.40 0.57KR-vs-KP 0.61 -0.02 0.33 0.54 -0.11 0.22 0.31 +0.04 0.38Abalone 73.89 -1.59 0.42 72.94 -0.81 0.15 73.80 -2.13 0.12Wine Quality 33.85 +1.69 0.25 33.57 +2.41 0.73 33.52 +1.11 0.77Waveform 14.15 -1.56 0.14 14.23 -1.67 0.36 14.20 -2.22 0.13Gisette 2.17 +0.25 -0.16 3.96 -0.40 0.69 2.24 +0.18 0.71Convex 26.34 +2.15 0.51 25.60 +2.96 0.98 23.13 +2.98 0.98

CIFAR-10-Small 65.66 +1.01 0.82 57.00 +1.40 0.93 56.04 +2.22 0.80MNIST Basic 4.97 -0.13 0.86 12.24 -2.24 1.00 3.56 +0.17 0.75Rot. MNIST + BI 65.79 -0.76 0.98 70.20 +2.90 0.50 58.03 +2.14 0.97Shuttle 0.0221 +0.0086 0.26 0.0144 +0.0106 0.60 0.0137 +0.0053 0.62KDD09-Appentency 1.7400 +0.0606 -1.00 1.7381 +0.1394 0.89 1.7394 +0.0134 0.50CIFAR-10 70.13 +0.42 0.83 66.05 +1.71 0.33 59.65 +1.47 0.95

It then computes the Spearman rank coefficient between the sequence of trainingperformances (evaluated by the optimization method through cross-validation) andthis generalization performance. The Spearman rank coefficient makes no assumptionabout the relationship between two performance measures (unlike, for example, thePearson correlation coefficient, which assumes a linear relationship). Since we wanteda measure that could identify situations where Auto-WEKA is making continualimprovements in performance, the Spearman rank coefficient is ideal. Table 4.5 showsthe average correlation coefficient for each run of Auto-WEKA, alongside the averagegap between the cross-validation performance and test performance from Tables 4.3and 4.4. We note a general trend: as the absolute gap between cross-validation and testperformance grows, the correlation coefficient decreases. The GermanCredit dataset isa good example where Auto-WEKA can signal that it only has low confidence in howwell its chosen hyperparameter settings will generalize. We do note, however, that thisweak signal has to be used with caution: there is no guarantee that large correlationcoefficients will yield a small gap or vice versa.

33

4.2.4 Selected methods

Figure 4.1 shows the distribution of classifiers chosen by our Auto-WEKA variants,aggregated across runs and datasets. We note that no single classifier clearly dominatedall others. The most frequently selected classifiers (SVMs, random forests, and thesingle-layer perceptron) were each only selected in roughly 13% of all cases, and mostother classifiers were selected in at least a few percent of cases. Furthermore, theselected methods differed considerably between large and small datasets, demonstratingthe need for dataset-specific methods. For example, large datasets benefited morefrom meta-methods than small ones, while the choice of a single-layer perceptronwas rare on large datasets. Figure 4.2 provides a more detailed breakdown of whichclassifiers Auto-WEKA selected on each dataset. In particular, it is interesting to seethat for each of GermanCredit and Wine Quality, there was one dominant classifier:the single-layer perceptron was the most selected method for GermanCredit, whilerandom forests were almost the only models chosen for Wine Quality. SVMs appearedto do well on medium-sized datasets, as there is a noticeable band between Waveformand MNIST Basic.

A more detailed investigation of the top two meta-methods in Figure 4.3 (left)shows which base-methods were chosen. Note that AdaBoostM1 frequently selectedSVMs on small datasets, but never on large ones, while the random and REP treeswere often chosen for large datasets. Note that this choice of trees on large datasetsmimics a random forest, one of the top-performing base-methods. In the MultiClassclassifier, the two most frequently selected methods were logistic regression and SVMs.It is interesting to note that logistic regression, as well as the random tree frequentlyselected by AdaBoostM1, were not often selected as base classifiers on their own.Figure 4.4 shows that as the size of the dataset increased, Auto-WEKA more oftenselected tree-based classifiers inside meta-methods than any other family of algorithms,perhaps due to their ability to be quickly trained with large amounts of data avoidingAuto-WEKA treating the evaluation as a failure. This underlines the importance ofsearching Auto-WEKA’s entire hyperparameter space instead of restricting attentionto a small number of a user’s favourite base classifiers.

Figure 4.3 (right) provides a breakdown of the feature search and evaluationmethods that Auto-WEKA selected. Overall, it selected the feature selection methodsfor use more often on smaller datasets than on larger ones, and if it did choose todo feature selection, it favoured using the ranker method. All feature evaluatorswere selected with roughly the same frequency for small datasets; in contrast, if

34

Auto-WEKA performed feature selection for a large dataset it favoured both the infogain evaluator and 1-R evaluator. We note that Auto-WEKA’s data-dependent choices(based on its internal cross-validation evaluation) allow it to use feature selection as aregularization method for small data sets, while at the same time using all features toconstruct more complex trained models for large datasets.

35

Selected Classifiers Frequency

Figure 4.1: Distribution of chosen classifiers aggregated across SMAC, I/F-Race andTPE Auto-WEKA variants across all the small and large datasets, ranked on theirfrequency of being selected. Meta-methods are marked by a ∗ suffix, ensemble methodsby a + suffix.

Selected Classifiers By Dataset

Figure 4.2: Heat map of chosen classifiers aggregated across SMAC, I/F-Race andTPE Auto-WEKA variants for each dataset. A darker colour indicates the methodwas selected more often. Meta-methods are marked by a ∗ suffix, ensemble methodsby a + suffix. Datasets are sorted by size, classifiers are ordered by methodology.

36

Selected Base Classifiers Frequency Feat. Search/Eval

Figure 4.3: Left: distribution of chosen base classifiers for the two most frequentlyselected meta-methods: AdaBoostM1 and MultiClass classifier. Right: distribution ofchosen feature search and evaluator methods. Both plots are aggregated across allAuto-WEKA variants; None indicates that no feature selection was performed.

Selected Base Classifiers By Dataset

Figure 4.4: Heat map of chosen classifiers in all chosen meta-methods aggregatedacross SMAC, I/F-Race, and TPE Auto-WEKA variants for each dataset. A darkercolour indicates the method was selected more often. Datasets are sorted by size,classifiers are ordered by methodology.

37

Table 4.6: Regression datasets used. Num Categorical and Num Numeric refer to thenumber of categorical and numeric attributes of elements in the dataset, respectively.

Name Num Num Num NumCategorical Numeric Training Test

ForestFires 2 10 362 155Crime 0 126 1 396 598Quake 0 3 1 525 653Abalone 1 7 2 924 1 253Parkinson’s Motor 1 19 4 113 1 762Parkinson’s Total 1 19 4 113 1 762Comp-Activ 0 25 5 529 2 369Bank 0 32 5 735 2 457Pumadyn 0 32 5 735 2 457COIL 2 83 5 822 4 000

House Census 0 137 15 949 6 835Relation Network 0 22 37 390 16 023Slice 0 385 37 450 16 050

4.3 Regression resultsWe used 13 datasets for evaluating Auto-WEKA’s performance on regression tasks(see Table 4.6): 9 datasets from the UCI repository [Frank and Asuncion, 2010] and4 datasets from the Delve repository [Rasmussen et al., 1997]. We used root meansquared error (RMSE) as the loss function. If a learning algorithm was unable to makeany predictions, we report a RMSE of 1010 to Auto-WEKA’s optimization method.WEKA has far fewer regression algorithms than classification algorithms (20 vs 33),resulting in Auto-WEKA searching over a 472-dimensional hyperparameter space afterfiltering out invalid methods.

We only used Ex-Def and random search as our baselines, due to the extremecomputational requirements of computing the performance of over 218 000 hyperpa-rameter settings/training and validation data pairs that would be needed to performgrid search over the base regression algorithms.

4.3.1 Results for training performance

Similar to our classification datasets, we studied Auto-WEKA’s ability to optimize10-fold cross-validation RMSE on the training data for our 13 datasets comparedagainst Ex-Def and random search. Table 4.7 details these results.

Similar to the classification datasets, random search outperformed Ex-Def on9/13 datasets. However, on the largest dataset, Ex-Def chose an algorithm with

38

Table 4.7: Training performance on regression datasets (RMSE). Bold entries denoteperformance statistically insignificant from the best, according to a Welch’s t test withp = 0.01.

Name Ex-Def Rand. Auto-WEKASearch I/F-Race TPE SMAC

Forest Fires 26.3215 19.2094 39.0064 32.6181 32.4824Crime 0.1355 0.1314 0.1337 0.1320 0.1316Quake 0.1924 0.1830 0.1917 0.1887 0.1884Abalone 2.1568 2.0843 2.0972 2.0678 2.0881Parkinson’s Motor 0.6720 0.9368 0.6918 0.6806 0.5561Parkinson’s Total 0.7576 0.6545 0.5394 0.5173 0.3075Comp-Activ 0.1877 0.1451 0.1500 0.1471 0.1449Bank 0.0857 0.0846 0.0847 0.0844 0.0845Pumadyn 0.020368 0.020223 0.082832 0.020302 0.020236COIL 0.2304 0.2255 0.2315 0.2289 0.2270

House Census 0.2187 0.2271 0.2203 0.2621 0.2183Relation Network 0.0270 0.0297 0.0298 0.0281 0.0280Slice 0.2900 0.5595 2.0966 4.9374 0.4622

substantially better RMSE than those selected by random search. The Slice datasetconsists of image slices of a CT scan of many patients and requires algorithms topredict where along the body (from head to toe) a given image slice is from. Intuitively,matching an image slice to other slices that are similar is likely to perform well, andEx-Def shows this by selecting K-nearest neighbours (KNN) as the best algorithm touse with an RMSE of 0.2900. There is a large gap to the next best method availableto Ex-Def for training performance, the K-Star algorithm with an RMSE of 0.7320.Random search yielded an RMSE of 0.5595, on average better than Ex-Def’s secondbest algorithm.

The best Auto-WEKA variant outperformed the best baseline on 4/13 datasets, wasindistinguishable in 4/13 cases, and was outperformed on 5/13: only on the smallest(Forest Fires) and largest (Slice) datasets did the baseline methods have a relativeimprovement over Auto-WEKA greater than 4% (random search on Forest Fires andEx-Def on Slice had 41% and 37% improvements over Auto-WEKA, respectively).Auto-WEKA’s relative improvements on Parkinson’s Motor and Parkinson’s Totalwere substantial, at 17% and 53%, respectively. Amongst the Auto-WEKA variants,SMAC again often found the best hyperparameter settings when compared againstTPE (in 10/13 cases) and I/F-Race (in 8/13 cases). TPE outperformed I/F-Raceon 3/13 datasets, and was indistinguishable on 8/13, with I/F-Race achieving betterperformance on the remaining 2/13 datasets.

39

Table 4.8: Test performance on regression datasets (RMSE). Bold entries denoteperformance statistically insignificant from the best, according to a Welch’s t test withp = 0.01.

Name Ex-Def Rand. Auto-WEKASearch I/F-Race TPE SMAC

Forest Fires 63.5548 64.1195 63.7933 64.3147 64.2247Crime 0.1404 0.1401 0.1375 0.1387 0.1371Quake 0.1776 0.1792 0.1787 0.1787 0.1793Abalone 2.1307 2.0739 2.2474 2.2689 2.1825Parkinson’s Motor 0.6323 0.7139 0.4306 0.4765 0.4412Parkinson’s Total 0.7999 0.4646 0.3444 0.4348 0.1565Comp-Activ 0.1560 0.1342 0.1358 0.1345 0.1348Bank 0.0869 0.0865 0.0857 0.0858 0.0862Pumadyn 0.019938 0.019987 0.019890 0.019974 0.019878COIL 0.2328 0.2340 0.2315 0.2325 0.2318

House Census 0.2164 0.2250 0.2172 0.2556 0.2155Relation Network 0.0274 0.0267 0.0271 0.0263 0.0256Slice 0.1816 0.2979 1.8123 3.4414 0.2974

4.3.2 Results for test performance

Although achieving good results in terms of cross-validation performance on trainingdata is important for verifying that Auto-WEKA searches the hyperparameter spaceseffectively, due to the repeated use of training data during the optimization processthese results may not be indicative of the true performance of selected hyperparam-eter settings; the practicality of Auto-WEKA needs be determined by measuringthe generalization performance on withheld test data. Amongst our baselines, Ex-Def did surprisingly well against random search considering its simplicity, selectingalgorithms with better performance on 5/13 datasets, while random search foundbetter hyperparameter settings on 4/13 datasets. However, in 8/13 cases, the relativedifferences between Ex-Def and random search were under 3%. Just like with thetraining data, the choice of KNN by Ex-Def on the Slice dataset was substantiallybetter than random search’s mean performance.

Auto-WEKA was still preferable to using any of our baseline methods: the bestAuto-WEKA method outperformed the best baseline on 7/13 datasets, and wasindistinguishable in 2/13 other cases. Again, the Parkinson’s Total and Parkinson’sMotor datasets showed substantial performance increases using Auto-WEKA overthe baselines, with relative improvements of 61% and 32%, respectively. Between thevariants of Auto-WEKA, SMAC outperformed TPE on 10/13 datasets, while TPE hasbetter performance in the remaining 3/13 cases. I/F-Race outperformed SMAC on

40

Table 4.9: Correlation between the withheld 30% validation data and the training dataperformance. Gap indicates the difference between the mean training performanceand mean test performance from Tables 4.7 and 4.8.

Name I/F-Race TPE SMAC

Test Gap Corr. Test Gap Corr. Test Gap Corr.(RMSE) (RMSE) (RMSE) (RMSE) (RMSE) (RMSE)

Forest Fires 63.7933 -24.7869 -0.12 64.3147 -31.6966 0.02 64.2247 -31.7423 -0.14Crime 0.1375 -0.0039 0.18 0.1387 -0.0068 0.60 0.1371 -0.0054 0.85Quake 0.1787 +0.0130 -0.08 0.1787 +0.0100 -0.01 0.1793 +0.0091 -0.21Abalone 2.2474 -0.1502 0.18 2.2689 -0.2011 0.61 2.1825 -0.0944 0.67Parkinson’s Motor 0.4306 +0.2612 0.34 0.4765 +0.2040 0.86 0.4412 +0.1149 0.82Parkinson’s Total 0.3444 +0.1950 0.26 0.4348 +0.0825 0.87 0.1565 +0.1511 0.91Comp-Activ 0.1358 +0.0141 -0.14 0.1345 +0.0126 0.51 0.1348 +0.0101 0.75Bank 0.0857 -0.0010 0.03 0.0858 -0.0013 0.80 0.0862 -0.0017 0.82Pumadyn 0.0199 +0.0629 0.16 0.0200 +0.0003 0.45 0.0199 +0.0004 0.61COIL 0.2315 +0.0000 -0.02 0.2325 -0.0037 0.78 0.2318 -0.0047 0.39

House Census 0.2172 +0.0031 0.04 0.2556 +0.0065 1.00 0.2155 +0.0028 0.79Relation Network 0.0271 +0.0027 0.58 0.0263 +0.0018 0.79 0.0256 +0.0024 0.95Slice 1.8123 +0.2843 0.77 3.4414 +1.4959 1.00 0.2974 +0.1648 1.00

3/13 datasets, was indistinguishable on 6/13, and was outperformed on 4/13 datasets.I/F-Race in turn outperformed TPE on 6/13 datasets, and was indistinguishable onthe remaining 7/13. On Parkinson’s Total and Slice, the relative improvement ofSMAC’s performance over I/F-Race was 83% and 54%, respectively, while the bestrelative improvement that I/F-Race found over SMAC on Parkinson’s Motor wasunder 3%.

As for the classification datasets, we withheld 30% of the training data from Auto-WEKA for use in detecting overfitting, by measuring the Spearman rank coefficientbetween the cross-validation RMSE and the RMSE on the validation data. Table 4.9provides these results. We observe the same weak signal that a low correlationindicates potential overfitting. It is also interesting to note that the Quake dataset,which appeared to be a difficult dataset for Auto-WEKA, has slightly anti-correlatedperformance, even though the gap is quite small.

4.3.3 Selected methods

The relative frequencies with which each regression algorithm was selected by Auto-WEKA are shown in Figure 4.5. Again, no single algorithm dominated across allthe datasets, but the algorithms selected appear to be much more dependent upondataset size than we observed for classification. The single-layer perceptron and linear

41

regression base-methods were frequently chosen for small datasets, while they werenever selected on large datasets. KNN, M5-Rules and K-Star were each selected byAuto-WEKA for large datasets more than 10% of the time, and were very unlikelyto be selected for small ones. Meta-methods were also heavily chosen, with additiveregression the most frequently selected learning method for small datasets and locallyweighted learning most often chosen for large datasets.

Figure 4.6 shows the distribution of algorithms that were selected for each dataset.Both Parkinson’s datasets contain the same set of features, but require predictions ofdifferent quantities, so it is interesting to see that similar methods were selected inboth cases – most often additive regression. Unlike classification, regression datasetsappear more likely to have a single algorithm that performs very well on them: thesingle-layer perceptron was frequently chosen for the Abalone and Bank datasets, linearregression was often selected for Comp-Activ, and KNN was often chosen on Slice(which was the best algorithm that Ex-Def selected that has an intuitive applicability).This may be due to the relative responsiveness of RMSE vs. misclassification rate; if aregression learning algorithm makes a large mistake for one prediction, the RMSE willincrease drastically, while making a mistake on classification results in a fixed penalty,no matter how ‘wrong’ the prediction is.

Figure 4.7 (left) shows the distribution of the chosen base algorithms inside the twomost frequently chosen meta-algorithms, additive regression and bagging. Under bothmeta-algorithms, the M5P algorithm was often chosen, followed by the M5-Rules andREP tree. Figure 4.8 shows that across all the meta-methods, tree-based algorithms(decision stump, M5P and REP tree) were the most successful, followed by M5-Rules.It is again interesting to note the strong performance of methods comprising manytrees. Linear regression was also a common choice, frequently selected on Comp-Activand Relation Network. Feature selection was not selected to be used as often forregression as it was for classification; Figure 4.7 (right) shows feature selection waschosen for use only 40% of the time for small datasets, and just over 20% of the timefor large datasets. Each search method was chosen equally on small datasets (whenfeature selection was selected), with one of the RELIEF, CFS subset, or principalcomponent evaluators, while the selected methods for large datasets only used theranker search method paired with the RELIEF evaluator. Since tree-based methodsare able to select features on which to split, the relative decrease in the use of featureselection preprocessing on large datasets may be explained by the fact that tree-basedmethods were chosen more frequently for such data.

42

Selected Regression Algorithm Frequency

Figure 4.5: Distribution of chosen regression algorithms aggregated across SMAC,I/F-Race and TPE Auto-WEKA variants across all small and large datasets, rankedon their frequency of being selected. Meta-methods are marked by a ∗ suffix, ensemblemethods by a + suffix.

Selected Regression Algorithms By Dataset

Figure 4.6: Heat map of chosen regression algorithms aggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset. A darker colour indicates thatthe method was selected more often. Meta-methods are marked by a ∗ suffix, ensemblemethods by a + suffix. Datasets are sorted by size, regression algorithms are orderedby methodology.

43

Selected Base Regression Algorithm Frequency Feat. Search/Eval

Figure 4.7: Left: distribution of chosen base regression algorithms for the two mostfrequently selected meta-methods: additive regression and bagging. Right: distributionof chosen feature search and evaluator methods. Both plots are aggregated across allAuto-WEKA variants; None indicates that no feature selection was performed.

Selected Base Regression Algorithms By Dataset

Figure 4.8: Heat map of chosen regression algorithms in all chose meta-methodsaggregated across SMAC, I/F-Race and TPE Auto-WEKA variants for each dataset.A darker colour indicates that the method was selected more often. Datasets aresorted by size, regression algorithms are ordered by methodology.

44

4.4 Other modifications of SMAC-based Auto-WEKAIn both our classification and regression experiments, the Auto-WEKA variant usingSMAC was able to find hyperparameter configurations that were better than thebaselines and other Auto-WEKA variants more consistently than any other method.We next investigated some different ways of modifying the Auto-WEKA variant usingSMAC in order to see if we could improve the performance of Auto-WEKA. The firstthree variants we implemented all focused on how the SMAC algorithm generates itsestimates of the loss function given a set of training data, while the last modificationsimply increased the amount of CPU time that we were willing to use for experiments.

4.4.1 Immediate evaluation of all folds

As described in Section 2.2.1, SMAC refines its internal estimate of the cross-validationerror by successively evaluating a model on more folds of data. This allows theAuto-WEKA variant using SMAC (hereafter referred to as SMAC for simplicity)to quickly discard configurations that have inferior performance based on evidencefrom a single fold. However, by the same token, it can also cause SMAC to discardhyperparameter configurations that are better than the current incumbent whenerror estimates between the folds are very noisy. We thus considered an alternateversion of SMAC which we term SMAC-10-Batch, which required the evaluation all 10cross-validation folds before either rejecting or accepting a new configuration. Observethat this is similar to the way that TPE operates.

Table 4.10 details our comparisons between the SMAC and SMAC-10-Batchvariants for classification datasets. On the training set, SMAC outperformed SMAC-10-Batch on 19/21 datasets, and when comparing generalization performance, SMACagain dominated SMAC-10-Batch on 19/21 datasets. Table 4.11 tells a similar storyfor regression datasets: SMAC outperformed SMAC-10-Batch on 10/13 datasets fortraining data, and on 11/13 datasets for test data. We believe that these differences inperformance resulted from the fact that SMAC-10-Batch did not explore as many dif-ferent hyperparameter configurations as SMAC, given our fixed time budget. We note,however, that in the limit both variants will end up returning the same hyperparametersettings, since they optimize the same function.

45

Table 4.10: Comparisons of mean performance obtained between the SMAC andSMAC-10-Batch variants on classification datasets. Bold entries denote performancestatistically insignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (% Error) Test Performance (% Error)

SMAC SMAC-10-Batch SMAC SMAC-10-Batch

Dexter 6.43 7.20 8.33 9.13GermanCredit 19.00 19.99 28.57 28.39Dorothea 5.70 6.10 5.98 6.35Yeast 35.97 35.18 39.08 39.73Amazon 47.60 50.06 32.87 33.09Secom 5.31 5.32 7.94 8.37Semeion 4.89 6.09 5.39 5.84Car 0.57 0.70 0.29 0.41Madelon 22.24 23.97 21.84 22.14KR-vs-KP 0.35 0.52 0.31 0.75Abalone 71.66 72.23 73.80 73.78Wine Quality 34.63 35.60 33.52 33.78Waveform 11.99 12.17 14.20 14.30Gisette 2.42 2.79 2.24 2.47Convex 26.11 28.09 23.13 24.94

CIFAR-10-Small 58.26 61.28 56.04 59.66MNIST Basic 3.73 5.43 3.56 5.32Rot. MNIST + BI 60.17 62.48 58.03 60.28Shuttle 0.0190 0.0306 0.0137 0.0147KDD09-Appentency 1.7528 1.7516 1.7394 1.8532CIFAR-10 61.12 65.90 59.65 65.90

Table 4.11: Comparisons of mean performance obtained between the SMAC andSMAC-10-Batch variants on regression datasets. Bold entries denote performancestatistically insignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (RMSE) Test Performance (RMSE)

SMAC SMAC-10-Batch SMAC SMAC-10-Batch

Forest Fires 32.4824 30.2953 64.2247 64.3291Crime 0.1316 0.1325 0.1371 0.1389Quake 0.1884 0.1888 0.1793 0.1786Abalone 2.0881 1.8770 2.1825 2.3003Parkinson’s Motor 0.5561 0.6789 0.4412 0.4850Parkinson’s Total 0.3075 0.5864 0.1565 0.4965Comp-Activ 0.1449 0.1449 0.1348 0.1342Bank 0.0845 0.0849 0.0862 0.0863Pumadyn 0.020236 0.020295 0.019878 0.019982COIL 0.2270 0.2280 0.2318 0.2321

House Census 0.2183 0.2226 0.2155 0.2194Relation Network 0.0280 0.0296 0.0256 0.0271Slice 0.4622 2.5006 0.2974 1.9412

46

Figure 4.9: Graphical representation of the training data partitioning scheme used bySMAC-Multi-Level.

4.4.2 Multi-level cross-validation

Since it appears that SMAC benefits from being able to quickly reject possible hyperpa-rameter configurations in order to explore more of the hyperparameter space, we nextimplemented a variant of Auto-WEKA to investigate this. SMAC-Multi-Level uses arelatively small portion of the training data when considering new hyperparameterconfigurations, and slowly uses more and more of the data to differentiate betweencompeting configurations. SMAC-Multi-Level breaks the training data up into 4 levelsof 10-fold cross-validation, where each level is a random 70% sample from the previouslevel (and where the largest level consists of all the training data). Figure 4.9 illustratesthis process. When SMAC-Multi-Level begins to investigate a new hyperparameterconfiguration, it uses the 10 cross-validation folds from level 0. Since there is a smalleramount of data to train and evaluate on at lower levels, SMAC-Multi-level evaluatesthese folds more quickly than the folds at higher levels, allowing for rapid rejectionof poorly performing hyperparameter settings. Once all of the cross-validation foldsare used at one level, SMAC-Multi-Level then uses folds from the next highest levelto additionally refine its estimate of the loss. This variant aims to reduce the risk ofoverfitting due to memorizing idiosyncrasies in the training data, since the full set ofdata at higher levels is only used infrequently to make decisions between competingalgorithms.

Table 4.12 shows the results from our experiments comparing SMAC and SMAC-Multi-Level for classification datasets. SMAC outperformed its multi-level variant on19/21 datasets for training performance, and on 18/21 datasets for test performance.The results for regression datasets, provided in Table 4.13, show that SMAC-Multi-Level was more competitive with SMAC in this case. In terms of training performance,SMAC outperformed SMAC-Multi-Level on 6/13 regression datasets, while the multi-level version outperformed the vanilla version on 7/13 datasets. The test performanceresults were similar, with SMAC outperforming SMAC-Multi-Level on 7/13 datasets,

47

Table 4.12: Comparisons of mean performance obtained between the SMAC andSMAC-Multi-Level variants on classification datasets. Bold entries denote performancestatistically insignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (% Error) Test Performance (% Error)

SMAC SMAC-Multi-Level SMAC SMAC-Multi-Level

Dexter 6.43 7.20 8.33 9.13GermanCredit 19.00 19.99 28.57 28.39Dorothea 5.70 6.10 5.98 6.35Yeast 35.97 35.18 39.08 39.73Amazon 47.60 50.06 32.87 33.09Secom 5.31 5.32 7.94 8.37Semeion 4.89 6.09 5.39 5.84Car 0.57 0.70 0.29 0.41Madelon 22.24 23.97 21.84 22.14KR-vs-KP 0.35 0.52 0.31 0.75Abalone 71.66 72.23 73.80 73.78Wine Quality 34.63 35.60 33.52 33.78Waveform 11.99 12.17 14.20 14.30Gisette 2.42 2.79 2.24 2.47Convex 26.11 28.09 23.13 24.94

CIFAR-10-Small 58.26 61.28 56.04 59.66MNIST Basic 3.73 5.43 3.56 5.32Rot. MNIST + BI 60.17 62.48 58.03 60.28Shuttle 0.0190 0.0306 0.0137 0.0147KDD09-Appentency 1.7528 1.7516 1.7394 1.8532CIFAR-10 61.12 65.90 59.65 65.90

Table 4.13: Comparisons of mean performance obtained between the SMAC andSMAC-Multi-Level variants on regression datasets. Bold entries denote performancestatistically insignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (RMSE) Test Performance (RMSE)

SMAC SMAC-Multi-Level SMAC SMAC-Multi-Level

Forest Fires 32.4824 18.2468 64.2247 64.0548Crime 0.1316 0.1329 0.1371 0.1386Quake 0.1884 0.1844 0.1793 0.1789Abalone 2.0881 2.0781 2.1825 2.0721Parkinson’s Motor 0.5561 1.2509 0.4412 0.5575Parkinson’s Total 0.3075 1.2997 0.1565 0.1473Comp-Activ 0.1449 0.1423 0.1348 0.1339Bank 0.0845 0.0841 0.0862 0.0858Pumadyn 0.020236 0.020159 0.019878 0.019990COIL 0.2270 0.2243 0.2318 0.2322

House Census 0.2183 0.2196 0.2155 0.2161Relation Network 0.0280 0.0330 0.0256 0.0264Slice 0.4622 0.8577 0.2974 0.3311

48

while SMAC-Multi-Level found better hyperparameter settings on average in 6/13cases. However, the relative performance difference was less than 1% in 7/13 cases,in 4/13 of which the multi-level variant performed better. The largest performancedifference that SMAC achieved had a relative performance improvement of 20% onthe Parkinson’s Motor dataset over SMAC-Multi-Level. By contrast, SMAC-Multi-Level’s best performance difference was a relative 6% improvement over SMAC on theParkinson’s Total dataset.

From these experiments, we conclude that SMAC is a better choice for classificationdatasets, but only has a slight advantage on regression problems. This may result fromthe fact that the cross-validation folds from the smallest levels do not provide reliableestimates of the generalization performance since they contain insufficiently manydata points, resulting in SMAC-Multi-Level discarding good hyperparameter settingstoo early. A new variant of Auto-WEKA could be designed to evaluate each fold in alevel at once, which may prevent the quick rejection of good hyperparameter settings,while the overhead of evaluating many folds simultaneously would be mitigated bytheir size in the smaller levels. Additionally, SMAC-Multi-Level may only be beneficialwhen SMAC has a tendency to overfit. The smaller regression datasets (that showedmore evidence of overfitting by SMAC than the small classification datasets) highlightthis, as SMAC-Multi-Level tended to perform better than SMAC in these cases.

4.4.3 Repeated random subsampling validation (RRSV)

Since SMAC only evaluates hyperparameter settings on more and more folds of cross-validation data when it needs more information in order to determine if a configurationis better or worse than the current best, there is no obstacle to supplying SMAC withmany more than 10 folds. While k-fold cross-validation can be used for large values ofk, it causes the number of validation data points used in each of the k folds to becomevery small. Such an approach would not benefit SMAC, since new hyperparameterconfigurations can be rejected based on a single fold. Repeated random subsamplingvalidation [Kohavi, 1995] provides a way to generate many folds of training andvalidation data while keeping the size of the validation set fixed. Our SMAC-RRSVvariant of Auto-WEKA generates each fold by randomly sampling a fixed percentageof the data to be used for training, while the remaining data is reserved for validation.SMAC-RRSV generates 1 000 samples with a 70-30 split between training and testdata.

Tables 4.14 and 4.15 show the results of our experiments. For training performance,SMAC-RRSV was better than SMAC on only 4/21 classification datasets, and never

49

Table 4.14: Comparisons of mean performance obtained between the SMAC andSMAC-RRSV variants on classification datasets. Bold entries denote performancestatistically insignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (% Error) Test Performance (% Error)

SMAC SMAC-RRSV SMAC SMAC-RRSV

Dexter 6.43 8.13 8.33 7.28GermanCredit 19.00 21.40 28.57 28.12Dorothea 5.70 5.95 5.98 6.40Yeast 35.97 38.04 39.08 39.94Amazon 47.60 54.90 32.87 35.24Secom 5.31 5.12 7.94 7.87Semeion 4.89 6.03 5.39 5.42Car 0.57 2.12 0.29 0.15Madelon 22.24 26.70 21.84 24.05KR-vs-KP 0.35 0.79 0.31 0.43Abalone 71.66 72.70 73.80 73.42Wine Quality 34.63 37.30 33.52 33.89Waveform 11.99 12.55 14.20 13.95Gisette 2.42 2.75 2.24 2.27Convex 26.11 26.28 23.13 22.28

CIFAR-10-Small 58.26 57.99 56.04 63.79MNIST Basic 3.73 4.75 3.56 4.24Rot. MNIST + BI 60.17 63.30 58.03 59.72Shuttle 0.0190 0.0357 0.0137 0.0081KDD09-Appentency 1.7528 1.7370 1.7394 1.7398CIFAR-10 61.12 52.96 59.65 69.87

Table 4.15: Comparisons of mean performance obtained between the SMAC and SMAC-RRSV variants on regression datasets. Bold entries denote performance statisticallyinsignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (RMSE) Test Performance (RMSE)

SMAC SMAC-RRSV SMAC SMAC-RRSV

Forest Fires 32.4824 43.5640 64.2247 63.7669Crime 0.1316 0.1347 0.1371 0.1387Quake 0.1884 0.1898 0.1793 0.1789Abalone 2.0881 2.1236 2.1825 2.0628Parkinson’s Motor 0.5561 0.9236 0.4412 0.5318Parkinson’s Total 0.3075 0.7893 0.1565 0.2798Comp-Activ 0.1449 0.1466 0.1348 0.1342Bank 0.0845 0.0845 0.0862 0.0865Pumadyn 0.020236 0.020331 0.019878 0.019955COIL 0.2270 0.2280 0.2318 0.2293

House Census 0.2183 0.2207 0.2155 0.2170Relation Network 0.0280 0.0320 0.0256 0.0286Slice 0.4622 1.5409 0.2974 0.9792

50

on regression. In generalization performance; SMAC outperformed SMAC-RRSV on13/21 classification datasets (with 4/21 cases of relative improvements greater than12%), and 8/13 regression datasets (with 3/13 cases of relative improvements greaterthan 17%). These experiments indicate that using SMAC with standard 10-foldcross-validation is preferable to using repeated random subsampling validation toestimate the generalization error. This is similar to results observed in the literature[Kohavi, 1995] where full 10-fold cross-validation was compared against evaluatingmany RRSV folds.

4.4.4 Longer runtimes

Our Auto-WEKA experiments all assumed that users would be willing to wait 30hours on a quad core machine to find good hyperparameter settings for their dataset,but it is not clear that this is the optimal amount of time to use. In particular, sincelarger datasets require more time to properly evaluate each candidate hyperparameterconfiguration, SMAC was not able to search as much of the hyperparameter spacein these cases. SMAC-Long is a variant of Auto-WEKA that instead used 120 CPUhours per core, requiring a total of 480 CPU hours of computation (4 times as long asour original experimental runtime of 30 hours per core). Tables 4.16 and 4.17 reportour results for SMAC-Long.

Since these experiments are effectively continuations of the original SMAC runs,training performance cannot become worse for SMAC-Long. In fact, we observed somesubstantial improvements, especially on larger datasets. For classification datasets,MNIST-Basic had a relative improvement of 27%, while the Shuttle dataset had arelative improvement of 36%. The largest regression dataset, Slice, also benefited fromlonger optimization runs, with an improvement of 38%. However, these improvementsin training performance did not always translate into improvements on test data:SMAC-Long was better than SMAC on only 12/21 classification datasets, and 8/13regression datasets with no indistinguishable results for either dataset type. Improve-ments occurred more often on the larger datasets, with relative improvements of 45%and 39% on the Shuttle and Slice datasets, respectively. However, there were a largenumber of datasets where the difference between the two variants was minimal: 8/21of the classification datasets had a relative improvement less than 3%, while 8/13 ofthe regression datasets had a relative improvement less than 1%.

Figure 4.10 shows the trajectory of the training and test performance over time fortwo representative small datasets, Amazon and Waveform. For the Amazon dataset,the hyperparameter settings that SMAC-Long found improved dramatically until just

51

around the 30 hour mark, at which point minimal improvements were made duringthe remaining hours. In the Waveform dataset, SMAC-Long made good improvementsuntil around roughly 20 hours, then stagnated for nearly the rest of the optimizationtime. It eventually managed to find configurations that had phenomenal trainingperformance, but overfit the data and generalized poorly. Figure 4.11 shows thetrajectory on the two largest datasets for classification and regression, CIFAR-10 andSlice. In both cases, SMAC-Long clearly benefited from having more time to optimizethe hyperparameters, as it made improvements right up to the 120 hour mark withoutevidence of overfitting.

These experiments show that SMAC sometimes benefits from longer runtimes,but the results may be susceptible to overfitting as the runtime lengthens. In futurework, it would be sensible for Auto-WEKA to take into account the size of thedataset when determining the time budget, or to use an adaptive approach thatdetermines when to stop the optimization. For example, we might say that if therelative improvements of the best hyperparameter settings stop improving by 10% for200 newly considered configurations, Auto-WEKA should terminate the optimization.This in turn raises the problem of automatically setting these thresholds, as too largea threshold would still allow for overfitting, while too small a threshold would resultin selected hyperparameter settings with poor performance.

52

Table 4.16: Comparisons of mean performance obtained between the SMAC andSMAC-Long variants on classification datasets. Bold entries denote performancestatistically insignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (% Error) Test Performance (% Error)

SMAC SMAC-Long SMAC SMAC-Long

Dexter 6.43 5.41 8.33 6.70GermanCredit 19.00 18.86 28.57 29.14Dorothea 5.70 5.41 5.98 6.55Yeast 35.97 29.67 39.08 38.47Amazon 47.60 40.36 32.87 26.32Secom 5.31 4.44 7.94 8.15Semeion 4.89 4.75 5.39 5.53Car 0.57 0.55 0.29 0.17Madelon 22.24 19.58 21.84 20.26KR-vs-KP 0.35 0.30 0.31 0.45Abalone 71.66 71.11 73.80 74.13Wine Quality 34.63 34.57 33.52 33.79Waveform 11.99 10.31 14.20 16.11Gisette 2.42 2.14 2.24 2.02Convex 26.11 22.52 23.13 20.17

CIFAR-10-Small 58.26 55.23 56.04 53.24MNIST Basic 3.73 2.72 3.56 2.74Rot. MNIST + BI 60.17 47.15 58.03 56.30Shuttle 0.0190 0.0121 0.0137 0.0075KDD09-Appentency 1.7528 1.6438 1.7394 1.7400CIFAR-10 61.12 55.11 59.65 53.23

Table 4.17: Comparisons of mean performance obtained between the SMAC and SMAC-Long variants on regression datasets. Bold entries denote performance statisticallyinsignificant from the best, according to a Welch’s t test with p = 0.01.

Name Training Performance (RMSE) Test Performance (RMSE)

SMAC SMAC-Long SMAC SMAC-Long

Forest Fires 32.4824 31.1628 64.2247 64.3557Crime 0.1316 0.1311 0.1371 0.1377Quake 0.1884 0.1874 0.1793 0.1791Abalone 2.0881 2.0707 2.1825 2.0673Parkinson’s Motor 0.5561 0.5129 0.4412 0.4511Parkinson’s Total 0.3075 0.2316 0.1565 0.1193Comp-Activ 0.1449 0.1436 0.1348 0.1344Bank 0.0845 0.0840 0.0862 0.0855Pumadyn 0.020236 0.020148 0.019878 0.019819COIL 0.2270 0.2262 0.2318 0.2316

House Census 0.2183 0.2175 0.2155 0.2156Relation Network 0.0280 0.0270 0.0256 0.0798Slice 0.4622 0.2863 0.2974 0.1796

53

Trajectory of Training and Test Performance for Amazon

Trajectory of Training and Test Performance for Waveform

Figure 4.10: Trajectories of training and test performance over time for two smalldatasets. The vertical black line indicates the original 30 hour time budget. Shadedareas show the 10-90% quantile from the bootstrapped samples.

54

Trajectory of Training and Test Performance for CIFAR-10

Trajectory of Training and Test Performance for Slice

Figure 4.11: Trajectories of training and test performance over time for two largedatasets. The vertical black line indicates the original 30 hour time budget. Shadedareas show the 10-90% quantile from the bootstrapped samples.

55

Chapter 5

Conclusion and future work

In this work, we have shown that the crucial problem of combined algorithm selectionand hyperparameter optimization (CASH) can be solved by a practical, fully automatedtool. This is made possible by recent Bayesian optimization techniques that iterativelybuild models of the algorithm/hyperparameter landscape and leverage these models toidentify new, promising points in the space to investigate. We built a tool, Auto-WEKA,that draws on the full range of classification and regression algorithms in WEKA andenables even novice users to build high quality predictive models for given applicationscenarios. An extensive empirical comparison on 34 prominent datasets showedthat Auto-WEKA outperformed standard algorithm selection and hyperparameteroptimization methods such as grid search and random search. We empirically comparedseveral different approaches for using the optimizers SMAC, TPE and I/F-Race forsearching Auto-WEKA’s high-dimensional hyperparameter space. In the end, werecommend an Auto-WEKA variant based on SMAC and using 10-fold cross-validation.We have written a freely downloadable software package to make Auto-WEKA easyfor end-users to access; it is available at www.cs.ubc.ca/labs/beta/Projects/autoweka/.

We see several promising avenues for future work. First, Auto-WEKA shows largerimprovements in cross-validation training data than on test data, suggesting a moresophisticated method for detecting and avoiding overfitting should be investigated.This is particularly important for model-based optimization methods, as hyperpa-rameter settings which appear artificially good tend to result in models that do notaccurately portray the true generalization performance of nearby hyperparameter set-tings, misleading the optimization process. Second, we see potential value in extendingAuto-WEKA to allow parameter sharing between learning algorithms used withinensemble methods, thereby allowing the hyperparameters of the base methods inside

56

an ensemble method to be treated the same inside the model regardless of which ofthe five inner base method slots is being used. Such parameter sharing would greatlyreduce the size of the search space and likely increase the chance of ensemble methodsbeing selected more often by Auto-WEKA.

Future researchers could investigate the topology of the hyperparameter space, forexample, by grouping different algorithms according to their methodology (rule-based,tree-based, etc.). This would allow the optimization methods to learn more abstractlywhich types of methods perform better on a particular datasets. Auto-WEKA’soptimization methods do not take into account the time required to train a new modelwhen selecting the next candidate to evaluate. We could incorporate the approach bySnoek et al. [2011], which uses runtime predictions to avoid evaluating hyperparametersettings that potentially exceed the time limit.

Despite the name of our tool, there is no reason why other machine learningpackages could not be supported, allowing for learning algorithms that leverage GPUs,such as Cuda-ConvNet [Krizhevsky et al., 2012], or dealing with datasets that are toobig to fit entirely into memory on a single machine. Finally, we could use Auto-WEKAas an inner loop for training ensembles of machine learning algorithms by iterativelyadding methods with maximal marginal contribution. This idea is conceptually relatedto the Hydra approach for constructing algorithm selectors [Xu et al., 2010].

While the CASH problem can be solved, there are still many avenues for improve-ment. We believe that this topic deserves the attention of other machine learningresearchers. Automated methods should be able to produce trained learning algorithmsto make high quality predictions, and thereby make machine learning more accessibleto a larger demographic.

57

Bibliography

P. Balaprakash, M. Birattari, and T. Stutzle. Improvement strategies for the F-Racealgorithm: Sampling design and iterative refinement. In Hybrid Metaheuristics,pages 108–122. Springer, 2007.

R. Bardenet, M. Brendel, B. Kegl, and M. Sebag. Collaborative hyperparametertuning. In Proceedings of the 30th International Conference on Machine Learning,pages 199–207, 2013.

Y. Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12(8):1889–1900, 2000.

J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization.Journal of Machine Learning Research, 13:281–305, 2012.

J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for Hyper-ParameterOptimization. In Proceedings of the Annual Conference on Neural InformationProcessing Systems, pages 2546–2554, 2011.

M. Birattari, T. Stutzle, L. Paquete, and K. Varrentrapp. A racing algorithm forconfiguring metaheuristics. In Proceedings of the Genetic and EvolutionaryComputation Conference, pages 11–18, 2002.

L. Bottou. On-line Learning in Neural Networks, chapter On-line Learning andStochastic Approximations, pages 9–42. Cambridge University Press, 1998.

H. Bozdogan. Model selection and Akaike’s information criterion (AIC): The generaltheory and its analytical extensions. Psychometrika, 52(3):345–370, 1987.

W. J. Conover. Practical nonparametric statistics. John Wiley & Sons, 1998.

T. Desautels, A. Krause, and J. Burdick. Parallelizing exploration-exploitationtradeoffs with gaussian process bandit optimization. In Proceedings of the 29thInternational Conference on Machine Learning, pages 1191–1198, 2012.

J. Dubois-Lacoste, M. Lopez-Ibanez, and T. Stutzle. A hybrid tp+ pls algorithm forbi-objective flow-shop scheduling problems. Computers & Operations Research, 38(8):1219–1236, 2011.

58

A. Frank and A. Asuncion. UCI machine learning repository, 2010. URLhttp://archive.ics.uci.edu/ml.

J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer Series in Statistics, 2009.

X. Guo, J. Yang, C. Wu, C. Wang, and Y. Liang. A novel LS-SVMs hyper-parameterselection based on particle swarm optimization. Neurocomputing, 71(16):3211–3215,2008.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. TheWEKA data mining software: an update. ACM Special Intrest Group onKnowledge Discovery and Data Mining Explorations Newsletter, 11(1):10–18, 2009.

W. Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963.

F. Hutter, H. Hoos, K. Leyton-Brown, and T. Stutzle. ParamILS: an automaticalgorithm configuration framework. Journal of Artifical Intelligence Research, 36(1):267–306, 2009.

F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization forgeneral algorithm configuration. In Proceedings of the Learning and IntelligentOptimization Conference, pages 507–523, 2011.

F. Hutter, H. H. Hoos, and K. Leyton-Brown. Parallel algorithm configuration. InProceedings of the Learning and Intelligent Optimization Conference, pages 55–70,2012.

Y. Jin and B. Sendhoff. Pareto-based multiobjective machine learning: An overviewand case studies. IEEE Transactions on Systems, Man, and Cybernetics, Part C:Applications and Reviews, 38(3):397–415, 2008.

D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization ofexpensive black box functions. Journal of Global Optimization, 13:455–492, 1998.

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation andmodel selection. In Proceedings of the International Joint Conferences on ArtificialIntelligence, pages 1137–1145, 1995.

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.Master’s thesis, Department of Computer Science, University of Toronto, 2009.

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deepconvolutional neural networks. In Proceedings of the Annual Conference on NeuralInformation Processing Systems, pages 1106–1114, 2012.

59

R. Leite, P. Brazdil, and J. Vanschoren. Selecting classification algorithms with activetesting. In Proceedings of the International Conference on Machine Learning andData Mining, pages 117–131, 2012.

O. Maron and A. Moore. Hoeffding races: Accelerating model selection search forclassification and function approximation. In Proceedings of the Annual Conferenceon Neural Information Processing Systems, pages 59–66, 1994.

B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Meta-learning by landmarkingvarious learning algorithms. In Proceedings of the International Conference onMachine Learning, pages 743–750, 2000.

C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow,Z. Ghahramani, R. Kustra, and R. Tibshirani. Delve machine learning repository,1997. URL http://www.cs.toronto.edu/∼delve.

T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Ruckstieß, andJ. Schmidhuber. PyBrain. Journal of Machine Learning Research, pages 743–746,2010.

M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrainedoptimization of computer models. Lecture Notes-Monograph Series, 34:11–25, 1998.

G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.

J. Snoek, H. Larochelle, and R. Adams. Opportunity cost in Bayesian optimization.In NIPS Workshop on Bayesian Optimization, Sequential Experimental Design, andBandits, 2011. Published online.

J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machinelearning algorithms. In Proceedings of the Annual Conference on NeuralInformation Processing Systems, pages 2960–2968, 2012.

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In Proceedings of the 27thInternational Conference on Machine Learning, pages 1015–1022, 2010.

V. Strijov and G. Weber. Nonlinear regression model generation usinghyperparameter optimization. Computers & Mathematics with Applications, 60(4):981–988, 2010.

Q. Sun and B. Pfahringer. Pairwise meta-rules for better meta-learning-basedalgorithm ranking. Machine Learning, 93(1):141–161, 2013.

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: combinedselection and hyperparameter optimization of classification algorithms. InProceedings of the 19th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 847–855, 2013.

60

T. Van Gestel, J. A. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene,B. De Moor, and J. Vandewalle. Benchmarking least squares support vectormachine classifiers. Machine Learning, 54(1):5–32, 2004.

R. Vilalta and Y. Drissi. A perspective view and survey of meta-learning. ArtificialIntelligence Review, 18(2):77–95, Oct. 2002.

L. Xu, H. H. Hoos, and K. Leyton-Brown. Hydra: Automatically configuringalgorithms for portfolio-based selection. In Proceedings of the Twenty-SeventhAAAI Conference on Artifical Intelligence, pages 210–216, 2010.

M. Zlochin, M. Birattari, N. Meuleau, and M. Dorigo. Model-based search forcombinatorial optimization: A critical survey. Annals of Operations Research, 131(1-4):373–395, 2004.

61

Appendix A

Method Comparison Results

The following tables provide the details of the pairwise comparisons of the empiricalperformance of each Auto-WEKA method and baseline. The number of datasetswhere an algorithm has better performance that is statistically significant for eachof classification and regression is reported. Statistical significance was determinedusing Welch’s T-test with p = 0.01. For example, in Table A.1, we see that SMACoutperforms limited grid search 20 times, while limited grid search never outperformsSMAC. There is one dataset where there is no significant difference (Dexter) betweenSMAC and limited grid search.

62

Table A.1: Number of statistical significant wins on training performance of eachmethod compared against another on classification datasets.

Compared Against Algorithm

Ex-

Def

Gri

dSe

arch

Lim

.G

rid

Sear

ch

Ran

d.Se

arch

SMA

C

TP

E

I/F

-Rac

e

Bes

tof

Aut

o-W

EK

A

Bes

tB

asel

ine

Bes

tLi

mit

edB

asel

ine

SMA

C-1

0-B

atch

SMA

C-M

ulti

-Lev

el

SMA

C-R

RSV

SMA

C-L

ong

Alg

orit

hm

Ex-Def – 3 7 9 1 6 2 1 0 0 4 11 3 0Grid Search 17 – 19 12 8 15 13 8 0 10 11 16 15 6Lim. Grid Search 14 0 – 6 0 10 4 0 0 0 6 14 10 0Rand. Search 10 7 7 – 2 10 4 2 0 0 5 10 6 0SMAC 20 12 20 18 – 20 15 0 10 16 19 19 17 0TPE 14 5 11 10 1 – 3 0 4 7 5 14 9 0I/F-Race 13 4 7 9 0 8 – 0 2 3 2 11 7 0Best of Auto-WEKA 20 12 20 18 1 20 15 – 10 16 20 19 17 0Best Baseline 18 8 18 13 10 17 13 10 – 10 15 17 16 6Best Limited Baseline 15 8 11 10 3 14 8 3 0 – 11 16 13 0SMAC-10-Batch 17 10 14 13 2 16 9 1 5 8 – 18 12 0SMAC-Multi-Level 10 4 6 7 2 7 4 2 4 5 3 – 3 0SMAC-RRSV 18 6 11 12 4 12 4 4 5 8 9 17 – 1SMAC-Long 21 14 21 20 21 21 21 21 13 20 21 21 20 –

Table A.2: Number of statistical significant wins on test performance of each methodcompared against another on classification datasets.

Compared Against Algorithm

Ex-

Def

Gri

dSe

arch

Lim

.G

rid

Sear

ch

Ran

d.Se

arch

SMA

C

TP

E

I/F

-Rac

e

Bes

tof

Aut

o-W

EK

A

Bes

tB

asel

ine

Bes

tLi

mit

edB

asel

ine

SMA

C-1

0-B

atch

SMA

C-M

ulti

-Lev

el

SMA

C-R

RSV

SMA

C-L

ong

Alg

orit

hm

Ex-Def – 5 9 9 4 9 4 3 0 0 11 6 6 5Grid Search 13 – 9 13 7 12 10 5 0 9 14 11 9 8Lim. Grid Search 9 6 – 9 4 8 6 2 2 0 9 8 6 6Rand. Search 7 4 5 – 2 6 3 2 0 0 5 4 2 5SMAC 17 12 16 15 – 16 11 0 10 12 19 18 13 9TPE 10 7 12 11 5 – 6 0 4 6 12 12 8 7I/F-Race 8 3 7 6 0 5 – 0 0 1 7 6 2 2Best of Auto-WEKA 18 12 18 15 5 14 13 – 10 14 21 18 14 9Best Baseline 17 6 15 13 9 14 12 7 – 9 17 13 12 11Best Limited Baseline 11 8 10 13 7 13 10 6 2 – 14 10 10 8SMAC-10-Batch 9 5 10 11 1 8 4 0 2 4 – 10 7 4SMAC-Multi-Level 14 8 10 10 3 9 5 2 5 7 10 – 5 3SMAC-RRSV 15 10 11 13 8 13 8 5 8 7 14 15 – 9SMAC-Long 16 12 14 14 12 13 13 11 9 12 16 15 12 –

63

Table A.3: Number of statistical significant wins on training performance of eachmethod compared against another on regression datasets.

Compared Against Algorithm

Ex-

Def

Ran

d.Se

arch

SMA

C

TP

E

I-F

/Rac

e

Bes

tof

Aut

o-W

EK

A

Bes

tB

asel

ine

SMA

C-1

0-B

atch

SMA

C-M

ulti

-Lev

el

SMA

C-R

RSV

SMA

C-L

ong

Alg

orit

hm

Ex-Def – 4 3 5 4 3 0 4 5 6 1Rand. Search 9 – 3 7 4 3 0 7 6 9 2SMAC 10 4 – 10 8 0 3 10 6 13 0TPE 8 4 2 – 3 0 2 4 5 9 1I-F/Race 4 1 0 2 – 0 0 1 3 4 0Best of Auto-WEKA 10 5 2 10 9 – 4 10 7 13 1Best Baseline 9 4 5 9 6 5 – 10 7 11 2SMAC-10-Batch 8 3 2 7 4 2 2 – 5 9 2SMAC-Multi-Level 8 5 7 8 5 6 5 8 – 10 4SMAC-RRSV 6 0 0 4 2 0 0 3 3 – 0SMAC-Long 12 8 13 12 11 12 8 11 9 13 –

Table A.4: Number of statistical significant wins on test performance of each methodcompared against another on regression datasets.

Compared Against Algorithm

Ex-

Def

Ran

d.Se

arch

SMA

C

TP

E

I-F

/Rac

e

Bes

tof

Aut

o-W

EK

A

Bes

tB

asel

ine

SMA

C-1

0-B

atch

SMA

C-M

ulti

-Lev

el

SMA

C-R

RSV

SMA

C-L

ong

Alg

orit

hm

Ex-Def – 5 4 6 2 3 0 6 4 6 3Rand. Search 4 – 2 4 1 2 0 2 0 2 1SMAC 9 6 – 10 4 0 6 11 7 8 5TPE 7 4 3 – 0 0 4 6 5 4 3I-F/Race 6 6 3 6 – 0 5 7 5 5 2Best of Auto-WEKA 9 7 4 9 4 – 7 10 8 8 4Best Baseline 4 5 5 7 2 4 – 6 4 6 3SMAC-10-Batch 7 3 2 5 0 1 3 – 4 3 3SMAC-Multi-Level 9 6 6 7 3 3 7 9 – 8 4SMAC-RRSV 7 7 5 8 1 3 4 9 5 – 6SMAC-Long 10 7 8 10 4 5 8 9 9 7 –

64