bayesian optimization and automated machine...

28
1/28 Bayesian Optimization and Automated Machine Learning Jungtaek Kim ([email protected]) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77 Cheongam-ro, Nam-gu, Pohang 37673, Gyeongsangbuk-do, Republic of Korea June 12, 2018

Upload: others

Post on 20-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

1/28

Bayesian Optimizationand Automated Machine Learning

Jungtaek Kim ([email protected])

Machine Learning Group,Department of Computer Science and Engineering, POSTECH,

77 Cheongam-ro, Nam-gu, Pohang 37673,Gyeongsangbuk-do, Republic of Korea

June 12, 2018

Page 2: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

2/28

Table of Contents

Bayesian OptimizationGlobal OptimizationBayesian OptimizationBackground: Gaussian Process RegressionAcquisition FunctionSynthetic Examplesbayeso

Automated Machine LearningAutomated Machine LearningPrevious WorksAutoML Challenge 2018Automated Machine Learning for Soft Voting in an Ensemble of Tree-basedClassifiersAutoML Challenge 2018 Result

References

Page 3: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

3/28

Bayesian Optimization

Page 4: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

4/28

Global Optimization

From Wikipedia (https://en.wikipedia.org/wiki/Local_optimum)

I A method to find global minimum or maximum of giventarget function:

x∗ = arg minL(x),

orx∗ = arg maxL(x).

Page 5: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

5/28

Target Functions in Bayesian Optimization

I Usually an expensive black-box function.

I Unknown functional forms or local geometric features such assaddle points, global optima, and local optima.

I Uncertain function continuity.

I High-dimensional and mixed-variable domain space.

Page 6: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

6/28

Bayesian Approach

I In Bayesian inference, given a prior knowledge for parameters,p(θ|λ) and a likelihood over dataset, conditional toparameters, p(D|θ,λ), the posterior distribution:

p(θ|D,λ) =p(D|θ,λ)p(θ|λ)

p(D|λ)=

p(D|θ,λ)p(θ|λ)∫p(D|θ,λ)p(θ|λ)dθ

where θ is a vector of parameters, D is an observed dataset,and λ is a vector of hyperparameters.

I Produce an uncertainty as well as a prediction.

Page 7: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

7/28

Bayesian Optimization

I A powerful strategy for finding the extrema of objectivefunctions that are expensive to evaluate,

I where one does not have a closed-form expression for theobjective function,

I but where one can obtain observations at sampled values.

I Since we do not know a target function, optimize acquisitionfunction, instead of the target function.

I Compute acquisition function using outputs of Bayesianregression model.

Page 8: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

8/28

Bayesian Optimization

Algorithm 1 Bayesian Optimization

Input: Initial data D1:I = {(xi , yi )1:I}.1: for t = 1, 2, . . . , do2: Predict a function f ∗(x|D1:I+t−1) considered as an objective

function.3: Find xI+t that maximizes an acquisition function,

xI+t = arg maxx a(x|D1:I+t−1).4: Sample the true objective function, yI+t = f (xI+t) + εI+t .5: Update on D1:I+t = {D1:I+t−1, (xt , yt)}.6: end for

Page 9: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

9/28

Background: Gaussian Process

I A collection of random variables, any finite number of whichhave a joint Gaussian distribution. [Rasmussen and Williams,2006]

I Generally, Gaussian process (GP):

f ∼ GP(m(x), k(x, x′))

where

m(x) = E[f (x)]

k(x, x′) = E[(f (x)−m(x))(f (x′)−m(x′))].

Page 10: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

10/28

Background: Gaussian Process Regression

−3 −2 −1 0 1 2 3x

−1.0

−0.5

0.0

0.5

1.0y

Page 11: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

11/28

Background: Gaussian Process Regression

I One of basic covariance functions, the squared-exponentialcovariance function in one dimension:

k(x, x′

)= σ2

f exp

(− 1

2l2(x− x′

)2)

+ σ2nδxx′ ,

where σf is the signal standard deviation, l is the length scaleand σn is the noise standard deviation. [Rasmussen andWilliams, 2006]

I Posterior mean function and covariance function:

µ∗ = K (X∗,X)(K (X,X) + σ2nI )−1y,

Σ∗ = K (X∗,X∗)

− K (X∗,X)(K (X,X) + σ2nI )−1K (X,X∗).

Page 12: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

12/28

Background: Gaussian Process Regression

I If non-zero mean prior is given, posterior mean and covariancefunctions:

µ∗ = K (X∗,X)(K (X,X) + σ2I )(y − µ(X)) + µ(X)

Σ∗ = K (X∗,X∗)

+ K (X∗,X)(K (X,X) + σ2I )−1K (X,X∗).

Page 13: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

13/28

Acquisition Functions

I A function that acquires a next point to evaluate for anexpensive black-box function.

I Traditionally, the probability of improvement (PI) [Kushner,1964], the expected improvement (EI) [Mockus et al., 1978],and GP upper confidence bound (GP-UCB) [Srinivas et al.,2010] are used.

I Several functions such as entropy search [Hennig and Schuler,2012] and a combination of existing functions [Kim and Choi,2018b] have been recently proposed.

Page 14: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

14/28

Traditional Acquisition Functions (Minimization Case)

I PI [Kushner, 1964]

aPI(x|D,λ) = Φ(Z ),

I EI [Mockus et al., 1978]

aEI(x|D,λ) =

{(f (x+)−µ(x))Φ(Z)+σ(x)φ(Z) if σ(x)>0

0 if σ(x)=0,

I GP-UCB [Srinivas et al., 2010]

aUCB(x|D,λ) = −µ(x) + βσ(x),

where

Z =

{f (x+)−µ(x)

σ(x)if σ(x)>0

0 if σ(x)=0

µ(x) := µ(x|D,λ), σ(x) := σ(x|D,λ).

Page 15: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

15/28

Synthetic Examples

−5.0 −2.5 0.0 2.5 5.0

0

20y

−5.0 −2.5 0.0 2.5 5.0x

0

5

acq.

(a) Iteration 1

−5.0 −2.5 0.0 2.5 5.0

0

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0

1

acq.

(b) Iteration 2

−5.0 −2.5 0.0 2.5 5.0

0

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0

2

acq.

(c) Iteration 3

−5.0 −2.5 0.0 2.5 5.00

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0.00

0.25

acq.

(d) Iteration 4

−5.0 −2.5 0.0 2.5 5.00

10

20y

−5.0 −2.5 0.0 2.5 5.0x

0.00

0.02

acq.

(e) Iteration 5

−5.0 −2.5 0.0 2.5 5.00

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0.000

0.001

acq.

(f) Iteration 6

Figure 1: y = 4.0 cos(x) + 0.1x + 2.0 sin(x) + 0.4(x− 0.5)2. EI is used tooptimize.

Page 16: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

16/28

bayeso

I Simple, but essential Bayesian optimization package.

I Written in Python.

I Licensed under the MIT license.

I https://github.com/jungtaekkim/bayeso

Page 17: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

17/28

Automated Machine Learning

Page 18: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

18/28

Automated Machine Learning

I Attempt to find automatically the optimal machine learningmodel without human intervention.

I Usually include feature transformation, algorithm selection,and hyperparameter optimization.

I Given a training dataset Dtrain and a validation dataset Dval,the optimal hyperparameter vector λ∗ for an automatedmachine learning system:

λ∗ = AutoML(Dtrain,Dval,Λ)

where AutoML is an automated machine learning system andλ ∈ Λ.

Page 19: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

19/28

Previous Works

I Bayesian optimization and hyperparameter optimizationI GPyOpt [The GPyOpt authors, 2016]I SMAC [Hutter et al., 2011]I BayesOpt [Martinez-Cantin, 2014]I bayesoI SigOpt API [Martinez-Cantin et al., 2018]

I Automated machine learning frameworkI auto-sklearn [Feurer et al., 2015]I Auto-WEKA [Thornton et al., 2013]I Our previous work [Kim et al., 2016]

Page 20: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

20/28

AutoML Challenge 2018

I Two phases: feedback phase and AutoML challenge phase.

I In the feedback phase, provide five datasets for binaryclassification.

I Given training/validation/test datasets, after submitting acode or prediction file, validation measure is posted in theleaderboard.

I In the AutoML challenge phase, determine challenge winners,comparing a normalized area under the ROC curve (AUC)metric for blind datasets:

Normalized AUC = 2 ·AUC− 1.

Page 21: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

21/28

AutoML Challenge 2018

Figure 2: Datasets of feedback phase in AutoML Challenge 2018. Train.#, Valid. #, Test #, Feature #, Chrono., and Budget stand for trainingdataset size, validation dataset size, test dataset size, the number offeatures, chronological order, and time budget, respectively. Time budgetshows in seconds.

Page 22: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

22/28

Background: Soft Majority Voting

I An ensemble method to construct a classifier using a majorityvote of k base classifiers.

I Class assignment of soft majority voting classifier:

ci = arg maxk∑

j=1

wjp(j)i

for 1 ≤ i ≤ n where n is the number of instances, arg maxreturns an index of maximum value in given vector,

wj ∈ R ≥ 0 is a weight of base classifier j , and p(j)i is a class

probability vector of base classifier j .

Page 23: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

23/28

Our AutoML System [Kim and Choi, 2018a]

Dataset

Automated Machine Learning System

Voting Classifier

Gradient BoostingClassifier

Extra-treesClassifier

Random ForestsClassifier

Bayesian Optimization

Prediction

Figure 3: Our automated machine learning system. Voting classifierconstructed by three tree-based classifiers: gradient boosting, extra-trees,and random forests classifiers produces predictions, where voting classifierand tree-based classifiers are iteratively optimized by Bayesianoptimization for the given time budget.

Page 24: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

24/28

Our AutoML System [Kim and Choi, 2018a]

I Written in Python.

I Use scikit-learn and our own Bayesian optimizationpackage.

I Split training dataset to training (0.6) and validation (0.4)sets for Bayesian optimization.

I Optimize six hyperparameters:

1. extra-trees classifier weight/gradient boosting classifier weightfor voting classifier,

2. random forests classifier weight/gradient boosting classifierweight for voting classifier,

3. the number of estimators for gradient boosting classifier,4. the number of estimators for extra-trees classifier,5. the number of estimators for random forests classifier,6. maximum depth of gradient boosting classifier.

I Use GP-UCB.

Page 25: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

25/28

AutoML Challenge 2018 Result

Figure 4: AutoML Challenge 2018 result. A normalized area under theROC curve (AUC) score (upper cell in each row) is computed for eachdataset, and a dataset rank (lower cell in each row) is determined bynumerical order of the normalized AUC score. Finally, an overall rank isdetermined by the average rank of five datasets.

Page 26: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

26/28

References

Page 27: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

27/28

References I

M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter.Efficient and robust automated machine learning. In Advances in NeuralInformation Processing Systems (NIPS), pages 2962–2970, Montreal, Quebec,Canada, 2015.

P. Hennig and C. J. Schuler. Entropy search for information-efficient globaloptimization. Journal of Machine Learning Research, 13:1809–1837, 2012.

F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimizationfor general algorithm configuration. In Proceedings of the International Conferenceon Learning and Intelligent Optimization, pages 507–523, Rome, Italy, 2011.

J. Kim and S. Choi. Automated machine learning for soft voting in an ensemble oftree-based classifiers, 2018a.https://github.com/jungtaekkim/automl-challenge-2018.

J. Kim and S. Choi. Clustering-guided GP-UCB for Bayesian optimization. InProceedings of the IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), Calgary, Alberta, Canada, 2018b.

J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML framework using randomspace partitioning optimizer. In International Conference on Machine LearningWorkshop on Automatic Machine Learning, New York, New York, USA, 2016.

H. J. Kushner. A new method of locating the maximum point of an arbitrarymultipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.

Page 28: Bayesian Optimization and Automated Machine Learningmlg.postech.ac.kr/~jtkim/slides/slides_talk_naver... · 2018-06-11 · J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML

28/28

References II

R. Martinez-Cantin. BayesOpt: A Bayesian optimization library for nonlinearoptimization, experimental design and bandits. Journal of Machine LearningResearch, 15:3735–3739, 2014.

R. Martinez-Cantin, K. Tee, and M. McCourt. Practical Bayesian optimization in thepresence of outliers. In Proceedings of the International Conference on ArtificialIntelligence and Statistics (AISTATS), Playa Blanca, Lanzarote, Canary Islands,2018.

J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods forseeking the extremum. Towards Global Optimization, 2:117–129, 1978.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.MIT Press, 2006.

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In Proceedings of theInternational Conference on Machine Learning (ICML), pages 1015–1022, Haifa,Israel, 2010.

The GPyOpt authors. GPyOpt: A Bayesian optimization framework in Python, 2016.https://github.com/SheffieldML/GPyOpt.

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combinedselection and hyperparameter optimization of classification algorithms. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and DataMining (KDD), pages 847–855, Chicago, Illinois, USA, 2013.