scalable bayesian optimization using deep neural networks · motivation bayesian optimization: •...

Scalable Bayesian Optimization using Deep Neural Networks

Jasper Snoek

with

Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Prabhat, Ryan P. Adams

Motivation

Bayesian optimization:

• Global optimization of expensive, multi-modal and noisy functions

• E.g. the hyperparameters of machine learning algorithms

• Robots, chemistry, cooking recipes, etc

Bayesian Optimization for Hyperparameters

Instead of relying on intuition or brute-force strategies:

Perform a regression from the high-level model parameters to the error metric (e.g. classification error)

• Build a statistical model of the function, with a suitable prior – e.g. a Gaussian process

• Use stats to tell us:

• Where is the expected minimum of the function?

• Expected improvement of trying other parameters

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2 True Function with Three Observations

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Bayesian nonlinear regression predictive distributions

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

How do the predictions compare to the current best?

How do the predictions compare to the current best?

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Expected Improvement

GPs as Distributions over FunctionsPrior Posterior

But the computational cost grows cubically in N!

Having a Statistical Framework Helps

• Reason about constraints:• Gramacy et al., 2010,. Gardner et al., 2014., Gelbart, Snoek & Adams, 2014. …

• Think about multi-task & transfer across related problems• Krause & Ong, 2011. Hutter et al., 2011. Bardenet et al., 2013. Swersky, Snoek & Adams,

2013. …

• Run experiments in parallel• Ginsbourger & Riche, 2010. Hutter et al., 2011. Snoek, Larochelle & Adams, 2012.

Frazier et al., 2014 …

• Determine when to stop experiments early• Swersky, Snoek & Adams, 2014. Domhan et al., 2014

GP-Based Bayesian Optimization • Gaussian Processes scale poorly - N3

• Due to having to invert data covariance matrix• This prevents us from…

• Running hundreds/thousands of experiments in parallel• Sharing information across many optimizations• Modeling every epoch of learning (early stopping)• Having very complex constraint spaces• Tackling high dimensional problems

• In order to address more interesting problems, we have to scale it up

Need a Different Model• Random Forests

• Empirical estimate of uncertainty• Outperformed by neural nets in general

• Sparse GPs• Scale better but aren’t actually used in practice• Hard to get to work well. Uncertainty is not great

• Bayesian Neural Nets• Very flexible, powerful models• Marginalizing all the parameters is prohibitively expensive

Deep Nets for Global Optimization• A pragmatic Bayesian deep neural net

Bayesian Linear Regression

How does this work?

Expected Improvement depends on the predictive mean and variance of the model

How does this work?

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%


Expected Improvement depends on the predictive mean and variance of the model

How does this work?

m = �K�1�Ty 2 RD

K = ��T�+ I↵2 2 RD⇥D

last hidden layer of the neural net for test data

last hidden layer of the neural net for training data

D << N !

How does this work?

m = �K�1�Ty 2 RD

K = ��T�+ I↵2 2 RD⇥D

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%


last hidden layer of the neural net for test data

last hidden layer of the neural net for training data

D << N !

How does this work?

⌘(x) = �+ (x� c)T⇤(x� c)

We set a quadratic prior - a bowl centered in the middle of the search region

How does this work?

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%


⌘(x) = �+ (x� c)T⇤(x� c)

We set a quadratic prior - a bowl centered in the middle of the search region

ConstraintsAlmost every real problem has complex constraints

• Often unknown a-priori• E.g. training of a model diverging and producing NaN

• We developed a principled approach to dealing with constraints• Gelbart, Snoek & Adams. Bayesian Optimization with Unknown Constraints.

UAI 2014.

• Need to scale that up as well

Constraints

Use a classification neural net and integrate out the last layer (Laplace Approximation)

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

With 3 complete and 2 pending, what to do next?

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t


Use posterior predictive to “fantasize” outcomes.

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t



Compute acquisition function (EI) for each predictive

fantasy.

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t



Compute acquisition function (EI) for each predictive

fantasy.

Monte Carlo estimate of overall acquisition function.

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

Sample outputs for both objective and constraint

Monte Carlo Constrained EI

What about all the hyperparameters of this model?

Integrate out hyperparameters of Bayesian layers

What about all the hyperparameters of this model?

Integrate out hyperparameters of Bayesian layers

Use GP Bayesian optimization for the neural net hyperparameters

Putting it all together

Backprop down to the inputs to optimize for the most promising next experiment

How does it scale?

A collection of Bayesian optimization benchmarks(Eggensperger et. al)

How well does it optimize?

Convolutional Networks• Notoriously hard to tune

• 14 hyperparameters with broad support• e.g. learning rate, momentum, input dropout, dropout,

weight-decay, weight initialization, parameters on input transformations, etc.

• Very generic architecture

• Evaluate 40 in parallel on Intel® Xeon Phi™ coprocessors

Convolutional Networks

Achieved “state-of-the-art” within a few sequential steps

Image Caption Generation

Tune the hyperparameters of this model

• MS COCO Benchmark Dataset• Each experiment takes ~26 hours• 11 hyperparameters (including categorical)

• Approx half of the space is invalid• 500-800 in parallel

Zaremba, Sutskever & Vinyals, 2015




Iteration #500 1000 1500 2000 2500

Valid

atio

n BL

EU-4

Sco

re

0

5

10

15

20

25




Iteration #500 1000 1500 2000 2500

Valid

atio

n BL

EU-4

Sco

re

0

5

10

15

20

25

“A person riding a wave in the ocean” “A bird sitting on top of a field”




Iteration #500 1000 1500 2000 2500

Valid

atio

n BL

EU-4

Sco

re

0

5

10

15

20

25

“A person riding a wave in the ocean” “A bird sitting on top of a field”

“A horse riding a horse”

Other Interesting Decisions - Neural Net Basis Functions

tanh ReLU

tanh + ReLU

ThanksOren Rippel (MIT, Harvard)

Kevin Swersky (Toronto)

Ryan P. Adams (Harvard)

Ryan Kiros (Toronto)

Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary (Intel Parallel Labs)

Prabhat (Lawrence Berkeley National Laboratory)

scalable bayesian optimization using deep neural networks · motivation bayesian optimization: •...

Documents