scalable bayesian optimization using deep neural networks · motivation bayesian optimization: •...
TRANSCRIPT
Scalable Bayesian Optimization using Deep Neural Networks
Jasper Snoek
with
Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Prabhat, Ryan P. Adams
Motivation
Bayesian optimization:
• Global optimization of expensive, multi-modal and noisy functions
• E.g. the hyperparameters of machine learning algorithms
• Robots, chemistry, cooking recipes, etc
Bayesian Optimization for Hyperparameters
Instead of relying on intuition or brute-force strategies:
Perform a regression from the high-level model parameters to the error metric (e.g. classification error)
• Build a statistical model of the function, with a suitable prior – e.g. a Gaussian process
• Use stats to tell us:
• Where is the expected minimum of the function?
• Expected improvement of trying other parameters
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2 True Function with Three Observations
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Bayesian nonlinear regression predictive distributions
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
How do the predictions compare to the current best?
How do the predictions compare to the current best?
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
GPs as Distributions over FunctionsPrior Posterior
But the computational cost grows cubically in N!
Having a Statistical Framework Helps
• Reason about constraints:• Gramacy et al., 2010,. Gardner et al., 2014., Gelbart, Snoek & Adams, 2014. …
• Think about multi-task & transfer across related problems• Krause & Ong, 2011. Hutter et al., 2011. Bardenet et al., 2013. Swersky, Snoek & Adams,
2013. …
• Run experiments in parallel• Ginsbourger & Riche, 2010. Hutter et al., 2011. Snoek, Larochelle & Adams, 2012.
Frazier et al., 2014 …
• Determine when to stop experiments early• Swersky, Snoek & Adams, 2014. Domhan et al., 2014
GP-Based Bayesian Optimization • Gaussian Processes scale poorly - N3
• Due to having to invert data covariance matrix• This prevents us from…
• Running hundreds/thousands of experiments in parallel• Sharing information across many optimizations• Modeling every epoch of learning (early stopping)• Having very complex constraint spaces• Tackling high dimensional problems
• In order to address more interesting problems, we have to scale it up
Need a Different Model• Random Forests
• Empirical estimate of uncertainty• Outperformed by neural nets in general
• Sparse GPs• Scale better but aren’t actually used in practice• Hard to get to work well. Uncertainty is not great
• Bayesian Neural Nets• Very flexible, powerful models• Marginalizing all the parameters is prohibitively expensive
Deep Nets for Global Optimization• A pragmatic Bayesian deep neural net
Bayesian Linear Regression
How does this work?
Expected Improvement depends on the predictive mean and variance of the model
How does this work?
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
Expected Improvement depends on the predictive mean and variance of the model
How does this work?
m = �K�1�Ty 2 RD
K = ��T�+ I↵2 2 RD⇥D
last hidden layer of the neural net for test data
last hidden layer of the neural net for training data
D << N !
How does this work?
m = �K�1�Ty 2 RD
K = ��T�+ I↵2 2 RD⇥D
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
last hidden layer of the neural net for test data
last hidden layer of the neural net for training data
D << N !
How does this work?
⌘(x) = �+ (x� c)T⇤(x� c)
We set a quadratic prior - a bowl centered in the middle of the search region
How does this work?
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
← 80%
← 90%
← 95%
Expected Improvement
⌘(x) = �+ (x� c)T⇤(x� c)
We set a quadratic prior - a bowl centered in the middle of the search region
ConstraintsAlmost every real problem has complex constraints
• Often unknown a-priori• E.g. training of a model diverging and producing NaN
• We developed a principled approach to dealing with constraints• Gelbart, Snoek & Adams. Bayesian Optimization with Unknown Constraints.
UAI 2014.
• Need to scale that up as well
Constraints
Use a classification neural net and integrate out the last layer (Laplace Approximation)
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Use posterior predictive to “fantasize” outcomes.
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Use posterior predictive to “fantasize” outcomes.
Compute acquisition function (EI) for each predictive
fantasy.
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
With 3 complete and 2 pending, what to do next?
Use posterior predictive to “fantasize” outcomes.
Compute acquisition function (EI) for each predictive
fantasy.
Monte Carlo estimate of overall acquisition function.
Parallelism
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Exp
ect
ed
Im
pro
vem
en
t
Sample outputs for both objective and constraint
Monte Carlo Constrained EI
What about all the hyperparameters of this model?
Integrate out hyperparameters of Bayesian layers
What about all the hyperparameters of this model?
Integrate out hyperparameters of Bayesian layers
Use GP Bayesian optimization for the neural net hyperparameters
Putting it all together
Backprop down to the inputs to optimize for the most promising next experiment
How does it scale?
A collection of Bayesian optimization benchmarks(Eggensperger et. al)
How well does it optimize?
Convolutional Networks• Notoriously hard to tune
• 14 hyperparameters with broad support• e.g. learning rate, momentum, input dropout, dropout,
weight-decay, weight initialization, parameters on input transformations, etc.
• Very generic architecture
• Evaluate 40 in parallel on Intel® Xeon Phi™ coprocessors
Convolutional Networks
Achieved “state-of-the-art” within a few sequential steps
Image Caption Generation
Tune the hyperparameters of this model
• MS COCO Benchmark Dataset• Each experiment takes ~26 hours• 11 hyperparameters (including categorical)
• Approx half of the space is invalid• 500-800 in parallel
Zaremba, Sutskever & Vinyals, 2015
Image Caption Generation
Tune the hyperparameters of this model
Zaremba, Sutskever & Vinyals, 2015
Iteration #500 1000 1500 2000 2500
Valid
atio
n BL
EU-4
Sco
re
0
5
10
15
20
25
Image Caption Generation
Tune the hyperparameters of this model
Zaremba, Sutskever & Vinyals, 2015
Iteration #500 1000 1500 2000 2500
Valid
atio
n BL
EU-4
Sco
re
0
5
10
15
20
25
“A person riding a wave in the ocean” “A bird sitting on top of a field”
Image Caption Generation
Tune the hyperparameters of this model
Zaremba, Sutskever & Vinyals, 2015
Iteration #500 1000 1500 2000 2500
Valid
atio
n BL
EU-4
Sco
re
0
5
10
15
20
25
“A person riding a wave in the ocean” “A bird sitting on top of a field”
“A horse riding a horse”
Other Interesting Decisions - Neural Net Basis Functions
tanh ReLU
tanh + ReLU
ThanksOren Rippel (MIT, Harvard)
Kevin Swersky (Toronto)
Ryan P. Adams (Harvard)
Ryan Kiros (Toronto)
Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary (Intel Parallel Labs)
Prabhat (Lawrence Berkeley National Laboratory)