pure exploration bayesian optimization acquisition ... · from: altexsoft.com comp-652 and...

Lecture 17: Bayesian Optimization

• Pure Exploration

• Bayesian Optimization

• Acquisition Functions

• Practical challenges

COMP-652 and ECSE-608, Lecture 17 - November 5, 2018 1

Recall: Online function approximation

0.0 0.2 0.4 0.6 0.8 1.0

Input space

0.0

0.5

1.0

Ob

ject

ive

• Sequentially select locations where to observe a function f

• Noisy observations

• Gathering an observation is not free


Recall: Stochastic bandit with structured actions

0.0 0.2 0.4 0.6 0.8 1.0

Actions

0.0

0.5

1.0

Rew

ard

• Action space X ⊆ Rn

• Reward function f : X 7→ R• For each round t:

1. Select an action xt ∈ X2. Observe reward rt = f(xt) + εt ← Observation noise εt

Goal: Maximize∑Tt=1 f(xt) → play x? = arg maxx∈X f(x)


Exploration/Exploitation tradeoff

Minimize regret: RT =

T∑

t=1

(f(x?)− f(xt))

• Learning while earning

• Maximize performance during learning

• No dedicated learning phase

• Examples:

– Treatment optimization– Tuning with A/B testing on customers


Pure Exploration

• Finite budget for exploration (evaluations of f are expensive)

• Dedicated exploration phase

• Exploration does not hurt during this phase

• Examples:

– Tuning with A/B testing on a user study– Parameter tuning in machine learning algorithms


Example: Tuning GUI with A/B testing

What position of buttons/windows/ads brings more clicks?More downloads? More purchases?

Real customers vs User study?


Example: Parameter tuning in ML algorithms

From: altexsoft.com

How many hidden layers? How many units per hidden layer?Learning rate? Regularization?


AutoML

• Automated data preparation/task detection (e.g., binary classification,regression, clustering, ranking)

• Automated feature engineering (e.g. feature selection)

• Automated model selection

• Hyperparameter optimization

• Automated analysis of results


Option 1: Use previous knowledge

• Select parameters as seen in previous papers

• Manual tuning

• Graduate student search

Not really efficient...


Option 2: Grid search

BERGSTRA AND BENGIO

Grid Layout Random Layout

Unim

por

tant

par

amet

er

Important parameter

Unim

por

tant

par

amet

er

Important parameter

Figure 1: Grid and random search of nine trials for optimizing a function f (x,y) = g(x)+ h(y) ≈g(x) with low effective dimensionality. Above each square g(x) is shown in green, andleft of each square h(y) is shown in yellow. With grid search, nine trials only test g(x)in three distinct places. With random search, all nine trials explore distinct values ofg. This failure of grid search is the rule rather than the exception in high dimensionalhyper-parameter optimization.

given learning algorithm, looking at several relatively similar data sets (from different distributions)reveals that on different data sets, different subspaces are important, and to different degrees. A gridwith sufficient granularity to optimizing hyper-parameters for all data sets must consequently beinefficient for each individual data set because of the curse of dimensionality: the number of wastedgrid search trials is exponential in the number of search dimensions that turn out to be irrelevant fora particular data set. In contrast, random search thrives on low effective dimensionality. Randomsearch has the same efficiency in the relevant subspace as if it had been used to search only therelevant dimensions.

This paper is organized as follows. Section 2 looks at the efficiency of random search in practicevs. grid search as a method for optimizing neural network hyper-parameters. We take the grid searchexperiments of Larochelle et al. (2007) as a point of comparison, and repeat similar experimentsusing random search. Section 3 uses Gaussian process regression (GPR) to analyze the results ofthe neural network trials. The GPR lets us characterize what Ψ looks like for various data sets,and establish an empirical link between the low effective dimensionality of Ψ and the efficiencyof random search. Section 4 compares random search and grid search with more sophisticatedpoint sets developed for Quasi Monte-Carlo numerical integration, and argues that in the regime ofinterest for hyper-parameter selection grid search is inappropriate and more sophisticated methodsbring little advantage over random search. Section 5 compares random search with the expert-guided manual sequential optimization employed in Larochelle et al. (2007) to optimize Deep BeliefNetworks. Section 6 comments on the role of global optimization algorithms in future work. Weconclude in Section 7 that random search is generally superior to grid search for optimizing hyper-parameters.

284

Bergstra and Bengio, 2012, Random Search for Hyper-Parameter Optimization

Scales exponentially with the number of dimensionsHow do you adapt the grid to the function smoothness?


Option 3: Random search

BERGSTRA AND BENGIO

Grid Layout Random Layout

Unim

por

tant

par

amet

er

Important parameter

Unim

por

tant

par

amet

er

Important parameter

Figure 1: Grid and random search of nine trials for optimizing a function f (x,y) = g(x)+ h(y) ≈g(x) with low effective dimensionality. Above each square g(x) is shown in green, andleft of each square h(y) is shown in yellow. With grid search, nine trials only test g(x)in three distinct places. With random search, all nine trials explore distinct values ofg. This failure of grid search is the rule rather than the exception in high dimensionalhyper-parameter optimization.

given learning algorithm, looking at several relatively similar data sets (from different distributions)reveals that on different data sets, different subspaces are important, and to different degrees. A gridwith sufficient granularity to optimizing hyper-parameters for all data sets must consequently beinefficient for each individual data set because of the curse of dimensionality: the number of wastedgrid search trials is exponential in the number of search dimensions that turn out to be irrelevant fora particular data set. In contrast, random search thrives on low effective dimensionality. Randomsearch has the same efficiency in the relevant subspace as if it had been used to search only therelevant dimensions.

This paper is organized as follows. Section 2 looks at the efficiency of random search in practicevs. grid search as a method for optimizing neural network hyper-parameters. We take the grid searchexperiments of Larochelle et al. (2007) as a point of comparison, and repeat similar experimentsusing random search. Section 3 uses Gaussian process regression (GPR) to analyze the results ofthe neural network trials. The GPR lets us characterize what Ψ looks like for various data sets,and establish an empirical link between the low effective dimensionality of Ψ and the efficiencyof random search. Section 4 compares random search and grid search with more sophisticatedpoint sets developed for Quasi Monte-Carlo numerical integration, and argues that in the regime ofinterest for hyper-parameter selection grid search is inappropriate and more sophisticated methodsbring little advantage over random search. Section 5 compares random search with the expert-guided manual sequential optimization employed in Larochelle et al. (2007) to optimize Deep BeliefNetworks. Section 6 comments on the role of global optimization algorithms in future work. Weconclude in Section 7 that random search is generally superior to grid search for optimizing hyper-parameters.

284

Bergstra and Bengio, 2012, Random Search for Hyper-Parameter Optimization

Better than grid search, but still expensive to guarantee good coverage


Another option: Bayesian optimization (BO)

• Sequentially gather observations at different locations in the input space

• Exploit the underlying structure of the input space ⇒ Kernel regression

• Bayesian view: Model expectation and uncertainty⇒ Gaussian Processes

• Once the budget is over, recommend the best location

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

f s


Recall: Gaussian Processes

• Distribution over functions

• Tractable Bayesian modeling of functions even with infinite basis funcs.

• Input space (the search space) X• Kernel function k : X × X 7→ R

Every finite set of N points {x1, . . . , xN} induces a N -dimensional Gaussiandistribution, which is the posterior distribution on {f(x1), . . . , f(xN)}

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

f s


Recall: Gaussian Processes posterior/prediction

• Consider noisy observations y = f(x) + ε = φ(x)>θ? + ε

• With Gaussian noise ε ∼ N (0, σ2)

• With Gaussian prior on parameters θ? ∼ Nd(0, Id)

[f(x)]x∈X ∼ N|X |([f(x)

]x∈X

, [kt(x, x′)]x,x′∈X

)

With

ft(x) = k(x)>(K + σ2It)−1y

kt(x, x′) = k(x, x′)− k(x)>

(K + σ2It

)−1k(x′) and

s2t (x) = kt(x, x)


BO Historical overview

• Optimal design of experiments (Kirstine Smith, 1918)

• Response surface methods (Box and Wilson, 1951)

• Bayesian optimization (Kushner, 1964, then Mockus, 1978)

• Boost of attention in ML starting 2007

→ It can be used to tuned ML hyperparameters!

From: altexsoft.com


Budgeted function optimization setting

0.0 0.2 0.4 0.6 0.8 1.0

Input space

0.0

0.5

1.0

Ob

ject

ive

• Action Input space X ⊆ Rn• Reward Objective function f : X 7→ R• For each round t = 1 . . . T :

1. Select an action input location xt ∈ X2. Observe reward output yt = f(xt) + εt ← Observation noise εt• After T rounds: recommend solution xT

Goal: Minimize |f(xT )− f(x?)| → recommend x? = arg maxx∈X f(x)


Simple regret

Minimize simple regret: RT = |f(xT )− f(x?)|

• You are committing to a single solution xT

• If xT is bad, you’ll do bad forever

• Examples:

– If you select poor hyperparameters for your algorithm– If you select a poor GUI for your website

Given some previously sampled locations x1, . . . , xt with observationsy1, . . . , yt, at which location xt+1 should I get my next sample?⇒ With the intent of minimizing Rt+1


Acquisition function

• Input space X ⊆ Rn

• Objective function f : X 7→ R• For each round t = 1 . . . T :

1. Select input location xt ∈ X that maximizes an acquisition function2. Observe output yt = f(xt) + εt ← Observation noise εt

• After T rounds: recommend solution xT

Goal: Minimize |f(xT )− f(x?)| → recommend x? = arg maxx∈X f(x)


Exploration/Exploitation tradeoff under pure exploration

Acquisition function should be:

• high for points we expect to be better than what we know

• high for points we’re uncertain about

• low for points we know

How different is it from the exploration/exploitation tradeoff whenminimizing regret?


Acquisition function in motion

Figure 1: An example of using Bayesian optimization on a toy 1D design problem.The figures show a Gaussian process (GP) approximation of the objective function overfour iterations of sampled values of the objective function. The figure also shows theacquisition function in the lower shaded plots. The acquisition is high where the GPpredicts a high objective (exploitation) and where the prediction uncertainty is high(exploration)—areas with both attributes are sampled first. Note that the area on thefar left remains unsampled, as while it has high uncertainty, it is (correctly) predictedto o↵er little improvement over the highest observation.

The posterior captures our updated beliefs about the unknown objective func-tion. One may also interpret this step of Bayesian optimization as estimatingthe objective function with a surrogate function (also called a response sur-face), described formally in §2.1 with the posterior mean function of a Gaussianprocess.

To sample e�ciently, Bayesian optimization uses an acquisition function todetermine the next location xt+1 2 A to sample. The decision represents anautomatic trade-o↵ between exploration (where the objective function is veryuncertain) and exploitation (trying values of x where the objective function isexpected to be high). This optimization technique has the nice property that itaims to minimize the number of objective function evaluations. Moreover, it islikely to do well even in settings where the objective function has multiple localmaxima.

3

Brochu et al., 2010, A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to

Active User Modeling and Hierarchical Reinforcement Learning






3




Acquisition function in motion - Overview




3




Probability of Improvement (PI)

• Consider that we are searching for x? = arg maxx∈X f(x)

• Noise-free setting

• Let f+t−1 = arg maxx∈{x1,...,xt−1} f(x)

• Probability of improvement at location x: PIt(x) = Pr[f(x) ≥ f+t−1]

Does not consider the magnitude of improvement


Expected Improvement (EI)

• Consider that we are searching for x? = arg maxx∈X f(x)

• Noise-free setting

• Let f+t−1 = arg maxx∈{x1,...,xt−1} f(x)

• Improvement at location x: It(x) = max(f(x)− f+t−1, 0

)

• Expected improvement:

Select xt = arg maxx∈X

E[It(x)|x1, y1, . . . , xt−1, yt−1]


Computing EI

• Recall: Noise-free setting

• Likelihood of improvement I on a normal posterior distributionparameterized by f(x) and s2(x):

1√2πst−1(x)

exp

(−(ft−1(x)− f+t−1 − I)2

2s2t−1(x)

)

E[I] =

∫ ∞

0

I1√

2πs(x)exp

(−(f(x)− f+ − I)2

2s2(x)

)dI

= s(x)

[f(x)− f+s(x)

Φ

(f(x)− f+s(x)

)+ φ

(f(x)− f+s(x)

)]

→Φ/φ are the CDF/PDF of a standard normal distribution


Computing EI in practice

• Observations are noisy → we don’t have f+t−1 directly

• Use f+t−1 = arg maxx∈{x1,...,xt−1} ft−1(x)

• Compensate uncertainty:

Zt(x) =

{f(x)−f+−ξ

s(x) if s(x) > 0

0 if s(x) = 0

E[It(x)] =

{(f(x)− f+ − ξ

)Φ (Zt(x)) + s(x)φ (Zt(x)) if s(x) > 0

0 if s(x) = 0

• In practice, ξ = 0.01 (scaled by noise variance if necessary) works well


EI example

import numpy

from scipy.stats import norm

def expected_improvement(X, X_sample, gpr, xi):

f_hat, std_hat = gpr.predict(X, return_std=True)

f_hat_sample = gpr.predict(X_sample)

s_hat = s_hat.reshape(-1, X_sample.shape[1])

f_opt = numpy.max(f_hat_sample)

improvement = f_hat - f_opt - xi

Z = improvement / s_hat

ei = improvement * norm.cdf(Z) + s_hat * norm.pdf(Z)

ei[s_hat == 0.0] = 0.0

return ei


Kernel/GP-UCB

UCBt(x) = ft−1(x) +√ντtst−1(x)

• Select xt = arg maxx∈X UCBt(x)

• Designed to optimize an exploration/exploitation tradeoff

• Designed to minimize regret with τt of order O(t)

Acquisition function in Bayesian optimization vs policy in bandits?


Acquisition function comparison

Figure 5: Examples of acquisition functions and their settings. The GP posterior isshown at top. The other images show the acquisition functions for that GP. From thetop: probability of improvement (Eqn (2)), expected improvement (Eqn (4)) and upperconfidence bound (Eqn (5)). The maximum of each function is shown with a trianglemarker.

Like other parameterized acquisition models we have seen, the parameter is left to the user. However, an alternative acquisition function has beenproposed by Srinivas et al. [2010]. Casting the Bayesian optimization problemas a multi-armed bandit, the acquisition is the instantaneous regret function

r(x) = f(x?)� f(x).

The goal of optimizing in the framework is to find:

min

TX

t

r(xt) = max

TX

t

f(xt),

where T is the number of iterations the optimization is to be run for.Using the upper confidence bound selection criterion with t =

p⌫⌧t and the

hyperparameter ⌫ > 0 Srinivas et al. define

GP-UCB(x) = µ(x) +p⌫⌧t�(x). (5)

15




Summary

• Disinguish regret from simple regret (pure exploration)

• Bayesian optimization has many applications, including in ML!

• In practice it is not clear what acquisition function to use

• Bayesian optimization raises an exploration/exploitation challenge witha different aim from bandit optimization

• What if improvement is costly?Maximize expected improvement per secondLearn the time function

Any challenges remaining?


Choosing the covariance function

• Kernel function k : X × X 7→ R is also called covariance function

• Space of functions covered by the GP depend on k (RKHS)

• Squared exponential (Gaussian) kernel:

Gaussian: kSE(x,x′) = exp

(−1

2r2(x,x′)

)

with r2(x,x′) =

n∑

i=1

(xi − x′i)2/ρ2i

• Isotropic if ρi = ρ for every dimension i, otherwise anisotropic

Functions defined by Gaussian kernel are often too smooth


Matern covariance

• Explicit the smoothness: ν

• Γ: Gamma function

• Kν: Bessel function of the second kind

• Matern kernel:

kν(x,x′) =

21−ν

Γ(ν)

(√2νr2(x,x′)

)νKν

(√2νr2(x,x′)

)

k5/2(x,x′) =

(1 +

√5r2(x,x′) +

5

3r2(x,x′)

)exp

(−√

5r2(x,x′))

with r2(x,x′) =

n∑

i=1

(xi − x′i)2/ρ2i

• When ν →∞, Matern kernel converges to Gaussian kernel


Covariance comparison: Posterior

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

Gaussian

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

Matern 1/2

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

Matern 5/2


Covariance hyperparameters

• The GP has its own hyperparameters to tune!

• GP hypers influence the posterior → the acquisition function

(a) Posterior samples under varying hyperparameters

(b) Expected improvement under varying hyperparameters

(c) Integrated expected improvement

Figure 1: Illustration of integrated expected improve-ment. (a) Three posterior samples are shown, eachwith different length scales, after the same five obser-vations. (b) Three expected improvement acquisitionfunctions, with the same data and hyperparameters.The maximum of each is shown. (c) The integratedexpected improvement, with its maximum shown.

(a) Posterior samples after three data

(b) Expected improvement under three fantasies

(c) Expected improvement across fantasies

Figure 2: Illustration of the acquisition with pend-ing evaluations. (a) Three data have been observedand three posterior functions are shown, with “fan-tasies” for three pending evaluations. (b) Expected im-provement, conditioned on the each joint fantasy of thepending outcome. (c) Expected improvement after in-tegrating over the fantasy outcomes.

This covariance function results in sample functions which are twice-differentiable, an assumptionthat corresponds to those made by, e.g., quasi-Newton methods, but without requiring the smooth-ness of the squared exponential.

After choosing the form of the covariance, we must also manage the hyperparameters that govern itsbehavior (Note that these “hyperparameters” are distinct from those being subjected to the overallBayesian optimization.), as well as that of the mean function. For our problems of interest, typicallywe would have D + 3 Gaussian process hyperparameters: D length scales ✓1:D, the covarianceamplitude ✓0, the observation noise ⌫, and a constant mean m. The most commonly advocated ap-proach is to use a point estimate of these parameters by optimizing the marginal likelihood under theGaussian process, p(y | {xn}N

n=1, ✓, ⌫, m) = N (y | m1,⌃✓ + ⌫I), where y = [y1, y2, · · · , yN ]T,and ⌃✓ is the covariance matrix resulting from the N input points under the hyperparameters ✓.

However, for a fully-Bayesian treatment of hyperparameters (summarized here by ✓ alone), it isdesirable to marginalize over hyperparameters and compute the integrated acquisition function:

a(x ; {xn, yn}) =

Za(x ; {xn, yn}, ✓) p(✓ | {xn, yn}N

n=1) d✓, (6)

where a(x) depends on ✓ and all of the observations. For probability of improvement and EI, thisexpectation is the correct generalization to account for uncertainty in hyperparameters. We cantherefore blend acquisition functions arising from samples from the posterior over GP hyperparam-eters and have a Monte Carlo estimate of the integrated expected improvement. These samples canbe acquired efficiently using slice sampling, as described in Murray and Adams [13]. As both opti-mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solvingan N -dimensional linear system (and our function evaluations are assumed to be much more expen-sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.Figure 1 shows how the integrated expected improvement changes the acquistion function.

3.2 Modeling Costs

Ultimately, the objective of Bayesian optimization is to find a good setting of our hyperparametersas quickly as possible. Greedy acquisition procedures such as expected improvement try to make

4

Snoek et al., 2012, Practical Bayesian Optimization of Machine Learning Algorithms


Selecting covariance hyperparameters

• Optimize the marginal likelihood under the Gaussian process:

log Pr[y1..t|x1..t, µ0, σ,ρ] =− 1

2(y1..t − µ0)

>(Kρ + σIt)−1(y1..t − µ0)

− 1

2log |Kρ + σIt| −

t

2log 2π

• Fully-Bayesian: integrated acquisition function:

∫a(x;x1..t,y1..t, µ0, σ,ρ)p(µ0, σ,ρ|x1..t,y1..t)dµ0dσdρ

→ Can be approximated with Monte Carlo


Integrated EI(a) Posterior samples under varying hyperparameters

(b) Expected improvement under varying hyperparameters

(c) Integrated expected improvement

Figure 1: Illustration of integrated expected improve-ment. (a) Three posterior samples are shown, eachwith different length scales, after the same five obser-vations. (b) Three expected improvement acquisitionfunctions, with the same data and hyperparameters.The maximum of each is shown. (c) The integratedexpected improvement, with its maximum shown.

(a) Posterior samples after three data

(b) Expected improvement under three fantasies

(c) Expected improvement across fantasies

Figure 2: Illustration of the acquisition with pend-ing evaluations. (a) Three data have been observedand three posterior functions are shown, with “fan-tasies” for three pending evaluations. (b) Expected im-provement, conditioned on the each joint fantasy of thepending outcome. (c) Expected improvement after in-tegrating over the fantasy outcomes.

This covariance function results in sample functions which are twice-differentiable, an assumptionthat corresponds to those made by, e.g., quasi-Newton methods, but without requiring the smooth-ness of the squared exponential.

After choosing the form of the covariance, we must also manage the hyperparameters that govern itsbehavior (Note that these “hyperparameters” are distinct from those being subjected to the overallBayesian optimization.), as well as that of the mean function. For our problems of interest, typicallywe would have D + 3 Gaussian process hyperparameters: D length scales ✓1:D, the covarianceamplitude ✓0, the observation noise ⌫, and a constant mean m. The most commonly advocated ap-proach is to use a point estimate of these parameters by optimizing the marginal likelihood under theGaussian process, p(y | {xn}N

n=1, ✓, ⌫, m) = N (y | m1,⌃✓ + ⌫I), where y = [y1, y2, · · · , yN ]T,and ⌃✓ is the covariance matrix resulting from the N input points under the hyperparameters ✓.

However, for a fully-Bayesian treatment of hyperparameters (summarized here by ✓ alone), it isdesirable to marginalize over hyperparameters and compute the integrated acquisition function:

a(x ; {xn, yn}) =

Za(x ; {xn, yn}, ✓) p(✓ | {xn, yn}N

n=1) d✓, (6)

where a(x) depends on ✓ and all of the observations. For probability of improvement and EI, thisexpectation is the correct generalization to account for uncertainty in hyperparameters. We cantherefore blend acquisition functions arising from samples from the posterior over GP hyperparam-eters and have a Monte Carlo estimate of the integrated expected improvement. These samples canbe acquired efficiently using slice sampling, as described in Murray and Adams [13]. As both opti-mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solvingan N -dimensional linear system (and our function evaluations are assumed to be much more expen-sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.Figure 1 shows how the integrated expected improvement changes the acquistion function.

3.2 Modeling Costs

Ultimately, the objective of Bayesian optimization is to find a good setting of our hyperparametersas quickly as possible. Greedy acquisition procedures such as expected improvement try to make

4



Examples of results

• Optimizing hyperparameters (layer-specific learning rates, weight decay,and a few other parameters) for a CNN on CIFAR-10

• Each function evaluation takes ∼ 1 hour

• Human expert = Alex Krizhevsky (creator of AlexNet)

0 10 20 30 40 50

0.2

0.25

0.3

0.35

0.4

Min

Funct

ion V

alu

e

Function evaluations

GP EI MCMCGP EI OptGP EI per SecondGP EI MCMC 3x ParallelHuman Expert

0 10 20 30 40 50 60 70

0.2

0.25

0.3

0.35

0.4

Min

funct

ion v

alu

e

Time (Hours)

GP EI MCMCGP EI OptGP EI per SecondGP EI MCMC 3x Parallel

Figure 6: Validation error on the CIFAR-10 data for different optimization strategies.

of an appropriate covariance significantly affects performance and the estimation of length scaleparameters is critical. The assumption of the infinite differentiability as imposed by the commonlyused squared exponential is too restrictive for this problem.

4.4 Convolutional Networks on CIFAR-10

Neural networks and deep learning methods notoriously require careful tuning of numerous hyper-parameters. Multi-layer convolutional neural networks are an example of such a model for which athorough exploration of architechtures and hyperparameters is beneficial, as demonstrated in Saxeet al. [21], but often computationally prohibitive. While Saxe et al. [21] demonstrate a methodologyfor efficiently exploring model architechtures, numerous hyperparameters, such as regularisationparameters, remain. In this empirical analysis, we tune nine hyperparameters of a three-layer con-volutional network [22] on the CIFAR-10 benchmark dataset using the code provided 5. This modelhas been carefully tuned by a human expert [22] to achieve a highly competitive result of 18% testerror on the unaugmented data, which matches the published state of the art result [23] on CIFAR-10. The parameters we explore include the number of epochs to run the model, the learning rate,four weight costs (one for each layer and the softmax output weights), and the width, scale andpower of the response normalization on the pooling layers of the network.

We optimize over the nine parameters for each strategy on a withheld validation set and report themean validation error and standard error over five separate randomly initialized runs. Results arepresented in Figure 6 and contrasted with the average results achieved using the best parametersfound by the expert. The best hyperparameters found by the GP EI MCMC approach achieve anerror on the test set of 14.98%, which is over 3% better than the expert and the state of the art onCIFAR-10. The same procedure was repeated on the CIFAR-10 data augmented with horizontalreflections and translations, similarly improving on the expert from 11% to 9.5% test error. To ourknowledge this is the lowest error reported, compared to the 11% state of the art and a recentlypublished 11.21% [24] using similar methods, on the competitive CIFAR-10 benchmark.

5 ConclusionWe presented methods for performing Bayesian optimization for hyperparameter selection of gen-eral machine learning algorithms. We introduced a fully Bayesian treatment for EI, and algorithmsfor dealing with variable time regimes and running experiments in parallel. The effectiveness of ourapproaches were demonstrated on three challenging recently published problems spanning differentareas of machine learning. The resulting Bayesian optimization finds better hyperparameters sig-nificantly faster than the approaches used by the authors and surpasses a human expert at selectinghyperparameters on the competitive CIFAR-10 dataset, beating the state of the art by over 3%.

Acknowledgements

The authors thank Alex Krizhevsky, Hoffman et al. [17] and Miller et al. [18] for making their codeand data available, and George Dahl for valuable feedback. This work was funded by DARPA YoungFaculty Award N66001-12-1-4219, NSERC and an Amazon AWS in Research grant.

References[1] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods

for seeking the extremum. Towards Global Optimization, 2:117–129, 1978.

5Available at: http://code.google.com/p/cuda-convnet/

8



CIFAR-10

60,000 32× 32 color images in 10 different classes


Additional challenges

• How to optimize the acquisition function?

• Parallel evaluations

– Enforce diversity– Parallelised Bayesian Optimisation via Thompson Sampling

(Kandasamy et al., 2018)

• Scaling up? Bayesian Deep Learning

– Prior on neural network weights– Given training data, compute posterior on weights→ Obtain posterior distribution on target functions


Some ressources

• Black-box Bayesian optimization using Spearmint:https://github.com/JasperSnoek/spearmint

• Practical Bayesian optimization of machine learning algorithms(Snoek, Larochelle, and Adams. 2012)http://papers.nips.cc/paper/4522-practical-bayesian-optimization-

of-machine-learning-algorithms

• Scalable Bayesian optimization using deep neural networks(Snoek et al. 2015)http://www.jmlr.org/proceedings/papers/v37/snoek15.pdf


https://github.com/JasperSnoek/spearmint

http://papers.nips.cc/paper/ 4522-practical-bayesian-optimization-of-machine-learning-algorithms

http://papers.nips.cc/paper/ 4522-practical-bayesian-optimization-of-machine-learning-algorithms

http://www.jmlr.org/proceedings/papers/v37/snoek15.pdf

pure exploration bayesian optimization acquisition ... · from: altexsoft.com comp-652 and...

Documents