4 automated machine learning (automl) and pentaho · agenda we will discuss how automated machine...

54
Automated Machine Learning (AutoML) and Pentaho Caio Moreno de Souza Pentaho Senior Consultant, Hitachi Vantara

Upload: danghuong

Post on 22-Nov-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

AutomatedMachineLearning(AutoML)andPentahoCaio MorenodeSouzaPentahoSeniorConsultant,HitachiVantara

Agenda

WewilldiscusshowAutomatedMachineLearning(AutoML)andPentaho,together,canhelpcustomerssavetimeintheprocessofcreatingamodelanddeployingthismodelintoproduction.

• BusinessCaseforAutomatedMachineLearning(AutoML)andPentaho;

• HighleveloverviewaboutAutomatedMachineLearning(AutoML);

• Demonstrations(Pentaho+AutoML).

ThePerfectModelDoesNotExist

“Allmodelsarewrong,butsomeareuseful.”

– GEORGEBOX,1919-2013

BusinessCaseforAutoMLandPentaho

• Findingthecorrectmachinelearningalgorithmisnotaneasytask.

• YouneedtofindabalancebetweenthetimeyouwouldneedtospendandthetimeyoucanactuallyspendontheMLproblem.

• Tocreateagoodmodelyouwillneedtoknowverywelltheproblem,thevariables(instances),preparethedata,featureengineeringandtestdifferentalgorithms.

• SomedatascientistswillalsosaytoaddalittlebitofMAGICJ.

• Adding,ofcourse,inmostcases,alotofcomputerpower.

MachineLearningHigh-LevelOverview

WhatisAutomatedMachineLearning(AutoML)?

IllustrationbyShyam Sundar Srinivasan

WhatisAutomatedMachineLearning(AutoML)?

“Machinelearningisverysuccessful,butitssuccessescruciallyrelyonhumanmachinelearningexperts,whoselectappropriateMLarchitectures(deeplearningarchitecturesormoretraditionalMLworkflows)andtheirhyperparameters.Asthecomplexityofthesetasksisoftenbeyondnon-experts,therapidgrowthofmachinelearningapplicationshascreatedademandforoff-the-shelfmachinelearningmethodsthatcanbeusedeasilyandwithoutexpertknowledge.WecalltheresultingresearchareathattargetsprogressiveautomationofmachinelearningAutoML.”https://sites.google.com/site/automl2016/

WhyAutomatedMachineLearning(AutoML)?

• Thedemandformachinelearningexpertshasoutpacedthesupply.Toaddressthisgap,therehavebeenbigstridesinthedevelopmentofuser-friendlymachinelearningsoftwarethatcanbeusedbynon-expertsandexperts,alike.

• AutoMLsoftwarecanbeusedforautomatingalargepartofthemachinelearningworkflow,whichincludesautomatictrainingandtuningofmanymodelswithinauser-specifiedtime-limit.

WhatisNOTAutomatedMachineLearning(AutoML)?

• AutoML isnotautomateddatascience;

• AutoML willnotreplaceDataScientist;– Allthemethodsofautomatedmachinelearningaredevelopedtosupportdatascientists,nottoreplacethem.– AutoML istofreedatascientistsfromtheburdenofrepetitiveandtime-consumingtasks(e.g.,machinelearningpipelinedesignandhyperparameteroptimization)sotheycanbetterspendtheirtimeontasksthataremuchmoredifficulttoautomate.

AutoMLTools

• AutoWeka(OpenSource)– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/

• H2o.aiAutoML(OpenSource)– https://www.h2o.ai/

• TPOT(OpenSource)– https://github.com/rhiever/tpot

• AutoSklearn(OpenSource)– https://github.com/automl/auto-sklearn– http://automl.github.io/auto-sklearn/stable/

• machineJS (OpenSource)– https://github.com/ClimbsRocks/machineJS

PDI+AutoML

MachineLearningwithPentahoin4Steps

http://www.pentaho.com/blog/4-steps-machine-learning-pentaho

CRISP-DM

http://www.pentaho.com/blog/4-steps-machine-learning-pentaho

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modeling

Evaluation

Deployment

Data

UseCase:AutoML+Pentaho

• OurusershaveawelldefinedMLproblemandtheinitialversionofthedataset(trainandtest).

• Unfortunately,theyhaven’tcreatedaMLmodelyet.

• Also,theyhavenoideahowtocreateit.• AndtheywantustohelpthemtocreateitassoonaspossibleusingonlyOpenSourcetools.

TheJourney

• Ifyouembarkinthisjourney,youcanstickinthisproblemforever…

…oryoucanfindquickwaystodoitinaspecifiedtime.

• CustomerscanthenspendenoughtimelatertoimprovetheircurrentModel.

• Thenextstepswillbe:– Hireadatascientistorateamofdatascientists;– Hireadomainexpertinthatproblem.

OurGoal

• Inthisspecificscenario,ourgoalwillbetohelpthemtostarttheprocessofcreatingadummymodelusingAutoML.

CreateYourFirstMLModel

1. Definetheproblem;

2. Analyzeandpreparethedata;

3. Selectalgorithms(startsimple);

4. Runandevaluatethealgorithms;

5. Improvetheresultswithfocusedexperiments;

6. Finalizeresultswithfinetuning.

SampleDataset

• Moredataisbetter,butmoredatameansmorecomplexity.

• Moredatameansmoretimethatyouwillhavetospendinyourproblem.

• Whynotcreateasampledataset?!– Create1to20datasetstotestyourproblemandcreateyourmodels;

DemoAutoML+Pentaho

• ThispresentationaimstodemotheprocessofhowAutoML opensourcetoolsandPentaho,together,canhelpcustomerssavetimeintheprocessofcreatingamodelanddeployingthismodelintoproduction.

ThePowerofPDI

• PDI(PentahoDataIntegration)willhelpdatascientistanddataengineerswithdataonboarding,datapreparation,datablending,modelorchestration(modelandpredict),savingandvisualizingthedata.

DataOnboarding,DataPreparationandDataBlending

• BelowwecanseeaDataPreparationProcessusingPDI(PentahoDataIntegration);• MLdatasetoutput:ARFFFile(WekaFile),CSV(Python,RandApacheSparkMLlib)andHadoopOutputtosavethetxtfiletotheDataLake;

PredictingNewValuesUsingYourModel

Demonstration

DemoAgenda

Whatwewillcoverinthedemo:

• DataPreparationwithPDI;• ModelcreationusingAutoML Tool;

• ModelDeploymentwithPDI;

PentahoDataIntegration+H2OAutoML

Summary

Whatwecoveredtoday:

• BusinessCaseforAutomatedMachineLearning(AutoML)andPentaho;

• HighleveloverviewaboutAutomatedMachineLearning(AutoML);

• Demonstrations(Pentaho+AutoML).

NextSteps

Wanttolearnmore?

• TalktomeduringPentahoWorld2017orsendmeane-mailcaio.moreno@HitachiVantara.com;

• Meet-the-Experts:– https://www.pentahoworld.com/meet-the-experts

Appendices

TopPredictionAlgorithms

• AccordingtoDataiku,thetoppredictionalgorithmsaretheonesexplainedintheimageontherightside.

• Thisimagealsoexplains(resumes)theadvantagesanddisadvantagesofeachalgorithm.

Source:https://blog.dataiku.com/machine-learning-explained-algorithms-are-your-friend

Algorithms

REXERanalyticsdatasciencesurvey*givesusagoodideaaboutwhichalgorithmshavebeenusedovertheyears.

*SpecialthankstoMarkHall(Pentaho)forsharingthisdocumentwithme.Documentavailableat:http://www.rexeranalytics.com/data-science-survey.html

CoreAlgorithms

Source: http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf

Tools

• Thehugeamountoftoolsincreasesthecomplexity.

Source: http://www.rexeranalytics.com/files/Rexer_Data_Science_Survey_Highlights_Apr-2016.pdf

AutoWeka

• AutoWeka– providesautomaticselectionofmodelsandhyperparametersfor WEKA.– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/

• OpendatasetsforAutoWeka– http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/

AutoSklearn

• AutoWekainspiredtheauthorsofAutoSklearn;

• AutoSklearn– auto-sklearnisanautomatedmachinelearningtoolkitandadrop-inreplacementforascikit-learnestimator.– https://github.com/automl/auto-sklearn– http://automl.github.io/auto-sklearn/stable/

TypesofMLProblemswith(AutoML)

• ThetypesofMachineLearningproblemsthatwecansolveusingAutoWekaandAutoSklearn areClassification,RegressionandClustering:– ClassificationandRegressionarealreadysupportedinAuto-sklearn&Auto-WEKA.– Forclustering,youcanuseaslongasyouhaveanobjectivefunctiontooptimize.

AutomatedbyTPOT

• TPOTwillautomatethemosttediouspartofmachinelearningbyintelligentlyexploringthousandsofpossiblepipelinestofindthebestoneforyourdata.

https://github.com/rhiever/tpot

AutoMLToolsInstallation

InstallingAutoWeka

• ToinstallAutoWeka,gotoWekaPackageManager>SearchforAuto-WEKAandclickthe“Install”button.

InstallingTPOT

• CommandtoinstallTPOT– $pipinstalltpot

• Learnmore:– http://rhiever.github.io/tpot/installing/

InstallingAutoSklearnonUbuntu

• Usethedocumentationbelowtohelpyou:– http://automl.github.io/auto-sklearn/stable/

• Runthiscommandonubuntuterminal:– $condainstallgccswig– $curlhttps://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt|xargs-n1-L1pipinstall– $sudoapt-getinstallbuild-essentialswig– $pipinstall–Uauto-sklearn

ErrorAutoSklearnonUbuntu

• ErrorreportedonJune,14th 2017.Solutionsentonthesameday.

• ChecktheGitHublinkbelowtofindthesolution:https://github.com/automl/auto-sklearn/issues/308

InstallingH20.ai

• ToinstallH20.aiAutoMLvisitthewebsites:– https://blog.h2o.ai/2017/06/automatic-machine-learning/– https://www.h2o.ai/

AutoMLDemonstration

UsingAutoWeka

• timeLimit=Youcandefinethetimeinminutesthat youwantAutoWekatousetorunandfindthebestoption.– Moretime=betterresults

UsingAutoWeka

• YoucanrunAutoWekafromtheWekaExplorerUserInterface

UsingAutoWeka

• Forbetterperformance,trygivingAuto-WEKAmoretime

UsingAutoWeka

• AutoWekaoutputresults

TestingAutoSklearn

• OpenSpyderandtestthecodebelow:

Sourcecode:http://automl.github.io/auto-sklearn/stable/

TestingAutoSklearn withIrisDataset

TestingH2o.aiAutoML

TotestH2oAutoMLisnecessarytoinstalltheversion3.11.0.3888orsuperior.http://h2o-release.s3.amazonaws.com/h2o/rel-vapnik/1/index.html

https://github.com/caiomsouza/machine-learning-orchestration/blob/master/AutoML/src/r/h2o-automl/H20_AutoML_Example.R

aml<- h2o.automl(x=x,y=y,training_frame=train,leaderboard_frame=test,max_runtime_secs=30)

#ViewtheAutoMLLeaderboardlb<- aml@leaderboardlb

DemoAutoML(AutoWeka)+Pentaho

• UsingAutoWekafromtheWekaUserInterfacewecreatedafirst“dummy”modelin15minutes.

• AutoWekawilloutputthebestmodelcreatedinthetimespecified,thismodelcanthenbeusedtopredictnewvalues.

AutoWekaoutput

NoFreeLunchTheorem

https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf

http://www.no-free-lunch.org/

http://philosophy.wisc.edu/forster/papers/Krakow.pdf