orie 4741: introduction to automl

33
ORIE 4741: Introduction to AutoML Chengrun Yang December 3, 2020 1 / 26

Upload: others

Post on 26-Feb-2022

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ORIE 4741: Introduction to AutoML

ORIE 4741: Introduction to AutoML

Chengrun Yang

December 3, 2020

1 / 26

Page 2: ORIE 4741: Introduction to AutoML

About me

I fifth-year PhD student in ECE

I working on AutoML-related problems

I email: [email protected]

2 / 26

Page 3: ORIE 4741: Introduction to AutoML

Outline

Motivation

Some AutoML systems

Demo!

Challenges

3 / 26

Page 4: ORIE 4741: Introduction to AutoML

Outline

Motivation

Some AutoML systems

Demo!

Challenges

4 / 26

Page 5: ORIE 4741: Introduction to AutoML

Machine learning is used everywhere ...

object detection drug discovery

speech recognition social science

5 / 26

Page 6: ORIE 4741: Introduction to AutoML

But there are real pitfalls ...

1. missing values and outliers are prevalent

2. feature engineering can be misleading

3. model training can be expensive

4. model selection can be more expensive

5. generalization can be tricky

6. . . .

6 / 26

Page 7: ORIE 4741: Introduction to AutoML

A subproblem: estimator selection

hyperparameter: the parameter that governs training process.

Types of hyperparameters: continuous, categorical, ordinal, ...

an estimator: an algorithm with a hyperparameter setting

e.g., ridge regression with λ = 1, decision tree with depth 3

In supervised learning, given a training set {(xi , yi )}ni=1, howwould we find a mapping f : X → Y?

I linear regression?

I random forest?

I gradient boosting?

I . . .

I try all the candidates in scikit-learn [PVG+11], or all theavailable neural network architectures?

7 / 26

Page 8: ORIE 4741: Introduction to AutoML

A subproblem: estimator selection

hyperparameter: the parameter that governs training process.

Types of hyperparameters: continuous, categorical, ordinal, ...

an estimator: an algorithm with a hyperparameter setting

e.g., ridge regression with λ = 1, decision tree with depth 3

In supervised learning, given a training set {(xi , yi )}ni=1, howwould we find a mapping f : X → Y?

I linear regression?

I random forest?

I gradient boosting?

I . . .

I try all the candidates in scikit-learn [PVG+11], or all theavailable neural network architectures?

7 / 26

Page 9: ORIE 4741: Introduction to AutoML

A subproblem: estimator selection

hyperparameter: the parameter that governs training process.

Types of hyperparameters: continuous, categorical, ordinal, ...

an estimator: an algorithm with a hyperparameter setting

e.g., ridge regression with λ = 1, decision tree with depth 3

In supervised learning, given a training set {(xi , yi )}ni=1, howwould we find a mapping f : X → Y?

I linear regression?

I random forest?

I gradient boosting?

I . . .

I try all the candidates in scikit-learn [PVG+11], or all theavailable neural network architectures?

7 / 26

Page 10: ORIE 4741: Introduction to AutoML

The machine learning pipeline space is huge

a pipeline: a directed graph of learning components

impute missing entriesby mean one-hot-encoderraw dataset

imputer encoder0 mean and unit

variance foreach feature

PCA 25% components

kNN k=5

standardizer dimensionalityreducer estimator

Predictions

Pipeline

Data scientists have so many choices to make:

I data imputer: fill in missing values by median? . . .

I encoder: one-hot encode? . . .

I standardizer: rescale each feature? . . .

I dimensionality reducer: PCA, or select by variance? . . .

I estimator: use decision tree or logistic regression? . . .

In this combinatorially large search space

1. impossible to enumerate all choices on large datasets

2. too expensive on small datasets

3. the best-on-average pipeline does not always perform the best

8 / 26

Page 11: ORIE 4741: Introduction to AutoML

The machine learning pipeline space is huge

a pipeline: a directed graph of learning components

impute missing entriesby mean one-hot-encoderraw dataset

imputer encoder0 mean and unit

variance foreach feature

PCA 25% components

kNN k=5

standardizer dimensionalityreducer estimator

Predictions

Pipeline

Data scientists have so many choices to make:

I data imputer: fill in missing values by median? . . .

I encoder: one-hot encode? . . .

I standardizer: rescale each feature? . . .

I dimensionality reducer: PCA, or select by variance? . . .

I estimator: use decision tree or logistic regression? . . .

In this combinatorially large search space

1. impossible to enumerate all choices on large datasets

2. too expensive on small datasets

3. the best-on-average pipeline does not always perform the best 8 / 26

Page 12: ORIE 4741: Introduction to AutoML

No Free Lunch

The “no free lunch (NFL)” theorem [Wol96]

There is no one model that works best for every problem.

On 215 midsize OpenML classification datasets:

I The best-on-average pipeline (highest average ranking):

impute missing entriesby mode

encodecategorical as

integerraw dataset

imputer encoder0 mean and unit

variance foreach feature

remove featureswith 0 variance

gradient boosting w/learning rate 0.25 and

maximum depth 3

standardizer dimensionalityreducer estimator

Predictions

The baseline pipeline

I The estimator types of best pipelines on individual datasets:

gradient boosting - 38.60%

multilayer perceptron - 20.93%

kNN - 10.23%

adaboost - 8.84%

extra trees - 5.58%

logistic regression - 5.58%

decision tree - 3.72%

random forest - 3.26%

linear SVM - 1.86%

Gaussian naive Bayes - 1.40%9 / 26

Page 13: ORIE 4741: Introduction to AutoML

Approaches to avoid exhaustive search

1. rule-based searchI grid searchI random searchI genetic programmingI . . .

2. build meta-models!I on a single dataset: build surrogate models to predict

performance of traditional modelsI Gaussian processesI reinforcement learning (e.g., multi-armed bandit)I experiment designI matrix factorization / tensor decompositionI . . .

I learning across datasets, a.k.a. meta-learning

10 / 26

Page 14: ORIE 4741: Introduction to AutoML

Grid search and random search

On two hyperparameters:

Image source: Bergstra & Bengio, 2012 [BB12].

I both are completely uninformedI random search handles unimportant dimensions better

I grid search with M explorations on N hyperparameters:bM1/Nc distinct values for each hyperparameter

Poll: the benefit of random search may on a larger numberof hyperparameters.

A. increaseB. decreaseC. it depends

11 / 26

Page 15: ORIE 4741: Introduction to AutoML

Grid search and random search

On two hyperparameters:

Image source: Bergstra & Bengio, 2012 [BB12].

I both are completely uninformedI random search handles unimportant dimensions better

I grid search with M explorations on N hyperparameters:bM1/Nc distinct values for each hyperparameter

Poll: the benefit of random search may on a larger numberof hyperparameters.

A. increaseB. decreaseC. it depends

11 / 26

Page 16: ORIE 4741: Introduction to AutoML

Genetic programming

Image source: dotnetlovers.com

“Survival of the fittest”:Automatically explore numerouspossible pipelines to find the bestfor the given dataset

12 / 26

Page 17: ORIE 4741: Introduction to AutoML

Bayesian optimization (BO)

BO: a sequential optimization strategy to find the extrema ofblack-box functions that are expensive to evaluate.

prior + function evaluations = posterior

the most common model: Gaussian processes

acquisition max

acquisition function (u( ·))

observation (x)objective fn (f( ·))

t = 2

new observation (xt)

t = 3

posterior mean (µ( ·))

posterior uncertainty(µ( ·)±σ( ·))

t = 4

Image source: Brochu et al, 2010 [BCDF10]. 13 / 26

Page 18: ORIE 4741: Introduction to AutoML

Multi-armed bandit

14 / 26

Page 19: ORIE 4741: Introduction to AutoML

Learning vs meta-learning

Training

Validation

Test

Learning

Learning

Training

Validation

Test

Learning

Training

Meta­learning

Validation

Test

Training

Validation

Test

Meta­training

       

Meta­validation

Meta­test

Meta-learning

I learning splits datasets

I meta-learning splits learning instances:

I same model, different datasets (“sets of datasets”)e.g., stock market data on different days

I different models, same datasete.g., performance of ridge regression at different λ’s

15 / 26

Page 20: ORIE 4741: Introduction to AutoML

Outline

Motivation

Some AutoML systems

Demo!

Challenges

16 / 26

Page 21: ORIE 4741: Introduction to AutoML

Some AutoML systems (for reference)

Optimizing over traditional models:

I hyperparameter optimization frameworks

I Auto-WEKA [THHLB13]:Bayesian optimization (BO) on conditional search space

I auto-sklearn [FKE+15]: meta-learning + BO

I TPOT [OUA+16]: genetic programming

I Hyperband [LJD+18]: multi-armed bandit

I PMF [FSE18]: matrix factorization + BO

I Oboe [YAKU19]: matrix factorization + experiment design

I AutoGluon [EMS+20]: ensembling

I . . .

Neural architecture search (NAS):

I Google NAS [ZL16]: reinforcement learning

I NASBOT [KNS+18]: BO + optimal transport

I Auto-Keras [JSH19]: BO + network morphism

I AutoML-Zero [RLSL20]: genetic programming

I . . .17 / 26

Page 22: ORIE 4741: Introduction to AutoML

Commercial AutoML tools

I Google AutoML Vision

I Microsoft Azure AutoML

I Amazon AutoGluon on SageMaker

I H2O AutoML

I . . .

18 / 26

Page 23: ORIE 4741: Introduction to AutoML

Outline

Motivation

Some AutoML systems

Demo!

Challenges

19 / 26

Page 24: ORIE 4741: Introduction to AutoML

Outline

Motivation

Some AutoML systems

Demo!

Challenges

20 / 26

Page 25: ORIE 4741: Introduction to AutoML

Challenge I: overfitting

Recall overfitting: low training error and high test error

More (layers of) learning, more possible overfitting!

Training

Validation

Test

Learning

Learning

Training

Validation

Test

Learning

Training

Meta­learning

Validation

Test

Training

Validation

Test

Meta­training

       

Meta­validation

Meta­test

Meta-learning

In AutoML,

I traditional overfitting: the selected models may overfit onthe original dataset

I meta-overfitting: the surrogate model may overfit on pastlearning instances

21 / 26

Page 26: ORIE 4741: Introduction to AutoML

Challenge II: hyper-hyperparameters

hyper-hyperparameters:

hyperparameters of the search rule or meta-model

Example:

I in grid search: selection interval, stopping criteria

I in Gaussian processes: which kernel, kernel parameters,acquisition function, . . .

I in meta-learning: how many datasets to learn from, whatmodels to use for knowledge transfer, . . .

Rationale: make human choices less and easier

22 / 26

Page 27: ORIE 4741: Introduction to AutoML

Challenge III: robustness

I traditional robustness: robustness to noise, outliers,adversarial attacks

I meta-robustness: robustness to noisy or adversarialmeta-learning instances

23 / 26

Page 28: ORIE 4741: Introduction to AutoML

Challenge IV: cost

A (decisive) impact factor on the advancement of AutoML:

Google RL-based NAS [ZL16]: 1k GPU days (> $70k on AWS)

→ FBNet [WDZ+19]: 10 GPU days ($700 on AWS)

24 / 26

Page 29: ORIE 4741: Introduction to AutoML

More considerations

I interpretability: how to improve it, or do we really need it?

I baseline: which one to compare to, human commonpractice, or human “best” practice?

I metrics: how to customize for specific problems?

25 / 26

Page 30: ORIE 4741: Introduction to AutoML

Summary

I AutoML has gained popularity in recent years.

I People try to automate every phase of machine learning.

I Most AutoML frameworks rely on informed search rules orsurrogate models.

I On top of the challenges in traditional ML, more may arisein AutoML.

26 / 26

Page 31: ORIE 4741: Introduction to AutoML

References I

James Bergstra and Yoshua Bengio.

Random search for hyper-parameter optimization.Journal of Machine Learning Research, 13(Feb):281–305, 2012.

Eric Brochu, Vlad M Cora, and Nando De Freitas.

A tutorial on bayesian optimization of expensive cost functions, with application to active user modelingand hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599, 2010.

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander

Smola.Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020.

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank

Hutter.Efficient and robust automated machine learning.In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.

Nicolo Fusi, Rishit Sheth, and Melih Elibol.

Probabilistic matrix factorization for automated machine learning.In Advances in Neural Information Processing Systems, pages 3352–3361, 2018.

Haifeng Jin, Qingquan Song, and Xia Hu.

Auto-keras: An efficient neural architecture search system.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining, KDD ’19, pages 1946–1956, New York, NY, USA, 2019. ACM.

Page 32: ORIE 4741: Introduction to AutoML

References II

Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing.

Neural Architecture Search with Bayesian Optimisation and Optimal Transport.arXiv preprint arXiv:1802.07191, 2018.

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar.

Hyperband: A novel bandit-based approach to hyperparameter optimization.Journal of Machine Learning Research, 18(185):1–52, 2018.

Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and

Jason H. Moore.Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto,Portugal, March 30 – April 1, 2016, Proceedings, Part I, chapter Automating Biomedical Data ScienceThrough Tree-Based Pipeline Optimization, pages 123–137.Springer International Publishing, 2016.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay.Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.

Esteban Real, Chen Liang, David R So, and Quoc V Le.

Automl-zero: Evolving machine learning algorithms from scratch.arXiv preprint arXiv:2003.03384, 2020.

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown.

Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms.In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and DataMining, pages 847–855. ACM, 2013.

Page 33: ORIE 4741: Introduction to AutoML

References III

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter

Vajda, Yangqing Jia, and Kurt Keutzer.Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages10734–10742, 2019.

David H Wolpert.

The lack of a priori distinctions between learning algorithms.Neural Computation, 8(7):1341–1390, 1996.

Chengrun Yang, Yuji Akimoto, Dae Won Kim, and Madeleine Udell.

Oboe: Collaborative filtering for automl model selection.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining, pages 1173–1183. ACM, 2019.

Barret Zoph and Quoc V Le.

Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016.