pitfalls of automl · • will include data prep, industry specific feature engineering, api...

30
Pitfalls of AutoML ANALYTICS FRONTIERS CONFERENCE 2019 CLIFF WEAVER

Upload: others

Post on 10-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Pitfalls of AutoMLA N A LY T I CS F R O NT I E R S CO NF E R E NCE 20 1 9C L I F F WE AVE R

Page 2: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Speaker Introduction• Cliff Weaver

• Let’s just say decades of work in technology

• Data obsessed

• R programming enthusiast and evangelist

• Limitless passion to solve problems with data

• Current work includes:• Local large company• RisMyHammer! • Local startups

Page 3: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Agenda

• AutoML – What is it?

• AutoML Options

• What are the pitfalls?

• The Future of AutoML

• Recommendations

• Q&A

Page 4: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Why This Topic?

Why is this topic important to you?

To improve your business by helping you make informed decisions when evaluating AutoML solutions.

• Dodge the avoidable risk

• Plan and implement AutoML thoughtfully

Page 5: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Agenda

• AutoML – What is it?

• AutoML Options

• What are the pitfalls?

• The Future of AutoML

• Recommendations

• Q&A

Page 6: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML – What is it NOT?

Before defining AutoML, distinguish machine learning from data science –it is more than a matter of semantics.

• Machine learning includes data modeling (selecting the best algorithm, tuning its parameters, etc.) which is part of a larger data science toolkit including data preparation, descriptive analytics, feature engineering, operationalization, etc.

• AutoML is NOT automated data science.

Page 7: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML – What is it?

• AutoML (Automatic Machine Learning) refers to automated methods for model selection and/or hyperparameter optimization.

• The goal of AutoML software is two-fold:• Enable non-experts to train high quality machine learning models

• Improve the efficiency to find optimal solutions to machine learning problems

• AutoML provides:• High-level machine learning algorithms to detect the best model or combination of

models for a specific problem

• Methods that automate hyperparameter optimization (relieving the data scientist from repetitive and tedious tasks)

the automation of automating automation 1

Page 8: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Hyper-Parameter Tuning

Nearly unlimited options!

Page 9: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

What did we learn?

• AutoML is not data science – it helps with just a piece of the overall process

• AutoML saves time by automating model optimization

• Model optimization looks (and is) painful

Page 10: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Agenda• AutoML – What is it?

• AutoML Options

• Where are the pitfalls?

• The Future of AutoML

• Recommendations

• Q&A

Page 11: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML – Options

• Almost every provider in the ML space provides some sort of AutoML solution• Some options are FREE, others $$$

• Free solutions are typically provided as packages to be used in R or Python• As powerful as the $$$ options without the interface• Require data scientists to properly leverage the AutoML packages

• Paid option typically provide• Deployment options• Inviting interfaces that guide users developing models• A platform for monitoring deployed models

Page 12: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Aut

oML

Opt

ions

Automl R - freehttps://cran.r-project.org/web/packages/AutoML/vignettes/howto_AutoML.pdf

TPOT - freehttps://github.com/EpistasisLab/tpot

caretEnsemble- freehttps://cran.r-project.org/web/packages/caretEnsemble/vignettes/caretEnsemble-intro.html

Azure Automated Machine Learning - $$$https://azure.microsoft.com/en-us/blog/announcing-automated-ml-capability-in-azure-machine-learning/

Autoxgboost – freehttps://github.com/ja-thomas/autoxgboost

DataRobot - $$$https://www.datarobot.com/

H2O – free and $$$https://www.h2o.ai/products/h2o/

Rapidminer Auto Model - $$$https://rapidminer.com/products/auto-model/

BigML - $$$https://bigml.com/

Auto_WEKA - freehttp://www.cs.ubc.ca/labs/beta/Projects/autoweka/

Auto-sklearn - freehttp://AutoML.github.io/auto-sklearn/stable/

Cloud AutoML - $$$https://cloud.google.com/AutoML/

MLJar - $$$https://mljar.com/

MLBox - freehttps://github.com/AxeldeRomblay/MLBox

Auto-Net 2.0https://www.AutoML.org/wp-content/uploads/2018/12/autonet-1.pdf

Binah.NOW - $$$https://www.binah.ai/

AutoKeras - freehttps://autokeras.com/

Microsoft Neural Network Intelligence - freehttps://github.com/Microsoft/nni

missinglink.ai - $$$

https://missinglink.ai/DATAIKU - $$$https://www.dataiku.com/

Uber’s Ludwig – freehttps://uber.github.io/ludwig/

Page 13: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

What did we learn?

• AutoML is not data science – it helps with just a piece of the overall process

• AutoML saves time by automating model optimization – saves time

• Model optimization looks painful!

• Lots of AutoML choices including open source and paid solutions

Page 14: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Agenda

• AutoML – What is it?

• AutoML Options

• What are the pitfalls?

• The Future of AutoML

• Recommendations

• Q&A

Page 15: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML Pitfalls

AutoML is just one part of the data science process

• View the business as a machine

• Understand the drivers

• Measure the drivers

• Uncover problems and opportunities

• Encode algorithms - AutoML

• Measure the results

• Report financial impact

Page 16: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML Pitfalls

• AutoML assists the data scientist only in one step of the overall process.

• While AutoML can save time for the data scientist by running many experiments automagically, it does not address the rest of the data science process tasks.• This is often overlooked by organizations overly-eager to chase the promise of predictive

analytics.

AutoML is NOT automated data science

Page 17: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML Pitfalls

• AutoML acceptance is growing rapidly and an ever-growing number of industries rely on it.

• This success crucially relies on data science experts to perform the following tasks:

• Preprocess and clean the data

• Select and construct appropriate features

• Select an appropriate model family

• Optimize model hyperparameters

• Postprocess machine learning models

• Critically analyze the results obtained

• Determine if the model needs to be re-trained

Page 18: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML Pitfalls

• Businesses adopting AutoML platforms as a solution and not just as a piece of the data science process workflow expose themselves to avoidable financial risk.

• At one extreme, the AutoML algorithm may under-perform others. Model performance is important.

• Skilled interpretation of model results directly impact financial performance

Page 19: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML Pitfalls – Results Matter

AutoML performance varies greatly across solutions

How do you pick the right one?

Dataset Id AutoML 1 AutoML 2 AutoML 3 179 0.497789992 0.315297615 0.30490367138 0.165474774 0.045179262 0.03895343544 0.383845093 0.136005382 0.125079404722 0.390824138 0.032291558 0.02981047737 0.48249143 0.330549118 0.367133044740 0.314242093 0.225437935 0.214965592741 0.572952603 0.567157133 0.524730876819 0.484707493 0.295373383 0.28652554821 0.560163406 0.254556068 0.241073802822 0.517901444 0.232634802 0.237353078823 0.420366818 0.066440903 0.040309454833 0.442981634 0.394451891 0.376915054837 0.342793913 0.243088173 0.206475294843 0.436367941 0.267071446 0.25310065845 0.455984028 0.314863141 0.255029891846 0.545753158 0.246755019 0.240192515

logloss

Not all AutoML solutions are equal

Page 20: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

AutoML Pitfalls – Model Interpretation

Max F1 is what AutoML tools

provide

An organization evaluates their overtime policy to determine which of several policies will benefit the organization best.

A new data scientist exclaims, That’s easy!

The data scientist inputs his data into a AutoML solution and behold – an answer! The data scientist reports to management that the Max F1 score is 54%.

The answer is wrong. (Missed by ~ $340k)

Page 21: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

What did we learn?• AutoML is not data science – it helps with just a piece of the overall process

• AutoML saves time by automating model optimization – saves time

• Model optimization looks painful!

• Lots of AutoML options both open source and paid solutions

• There is no easy button for data science

• There can be significant differences between AutoML solutions

• Data Science training is required to interpret AutoML results with business domain expertise

Page 22: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Agenda

• AutoML – What is it?

• AutoML Options

• What are the pitfalls?

• The Future of AutoML

• Recommendations

• Q&A

Page 23: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Future of AutoML

• The future of AutoML is coupled with the future of data science.• The future of data science will likely follow the path of past innovations:

• A calculator was once a person• Webmaster was once a hot career• Microsoft Office was once a typing pool (does anyone but me recall those days?)

• The data science role will (is?) changing and AutoML is the first visible evidence (data scientists will still have a job, it simply changes).

• Machine learning will become industry-specific, so perhaps will AutoML.

Page 24: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Future of AutoML

• Data scientists will have different enterprise roles once tools democratize machine learning.• Entry level data scientists will interpret data and make it usable and focus on educating end

users.

• Advanced data scientists will improve the performance of algorithms and designing new generalized approaches.

• Domain expertise with data science skills will become a key role in the future. These data scientists will apply data science techniques and tools in specific verticals like manufacturing, healthcare and finance.

• The evolution of the data science practice manager with proven abilities to build effective teams and the processes that drive the business to greater success.

Page 25: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Future of AutoML

TODAY

Currently, selecting the “best” algorithm to use per dataset requires a level of intuition or expertise about the data. Data scientists leverage their experience to experiment with different combinations of models and hyperparameter values to achieve the highest accuracy.

TOMORROW

AutoML will lessen the data scientist dependency on intuition by iteratively trying out an algorithm, scoring its performance, and choosing and refining other models. AutoML solutions will evolve to platforms automating more of the data science workflow.

Page 26: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Future of AutoML

• AutoML systems will become mainstream in the machine learning world.• Will include data prep, industry specific feature engineering, API deployment, monitoring

and retraining

• Future AutoML will be interactive where the user and AutoML system will work together

• AutoML system will learn in real-time from the user's experience and adapt its optimization process

• Because of AutoML, the data science role will fade and become a business science role

Page 27: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

What did we learn?

• AutoML is not data science – it helps with just a piece of the overall process

• AutoML saves time by automating model optimization – saves time

• Model optimization looks painful!

• Lots of AutoML options both open source and paid solutions

• There is no easy button for data science

• There can be significant differences between AutoML solutions

• Training is required to interpret AutoML results with business domain expertise

• AutoML will evolve dramatically and become mainstream tools – the future is bright

• The data scientist role will evolve in the coming years – the future is bright!

Page 28: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Agenda

• AutoML – What is it?

• AutoML Options

• What are the pitfalls?

• The Future of AutoML

• Recommendations

• Q&A

Page 29: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Recommendations

• AutoML should be viewed as a tool; it is not a solution or a substitute for data scientists

• Data Scientists must provide guidance and oversight to AutoML implementations

• AutoML solutions will advance quickly – time your investments wisely. (Its going to be exciting)

• Contrast and compare AutoML solutions – there are differences that matter. Shop around, become informed!

• Start small – your organization is not ready for a big-bang deployment (no one is). Requires process!

Page 30: Pitfalls of AutoML · • Will include data prep, industry specific feature engineering, API deployment, monitoring and retraining • Future AutoML will be interactive where the

Questions?Cliff Weaver

www.rismyhammer.com

https://www.linkedin.com/in/cliffordweaver/