day 8: ensemble methods, boosting - university of chicagosuriya/website-intromlss2018/... · 2018....

Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago

Instructor:SuriyaGunasekar,TTIChicago

27June2018

Day8:Ensemblemethods,boosting

Topicssofar

• Linearregression• Classification

o Logisticregressiono Maximummarginclassifiers,kerneltricko Generativemodelso Neuralnetworks,backpropagation,NNtraining– optimizationandregularization,specialarchitectures– CNNs,RNNs,encoder-decoder

• RemainingTopicso Ensemblemethods,boostingo Unsupervisedlearning– clustering,dimensionalityreductiono Reviewandtopicsnotcovered!

1

Ensemblelearning• Ensemblelearning

o Createapopulationofbaselearning𝑓", 𝑓$, … 𝑓&:𝒳 → 𝒴o Combinethepredictorstoformacompositepredictor

• Exampleinclassificationwith𝒴 = {−1,1}à assign“votes”𝛼1 toeachclassifier𝑓1 andtakeweighted-majorityvote

𝐹 𝑥 = sign ∑ 𝛼1𝑓1 𝑥&19"

o Individualclassifierscanbeverysimple,e.g.,𝑥" ≥ 10,𝑥< ≤ 5• Why?

o morepowerfulmodelsà reducebias§ e.g.,majorityvoteoflinearclassifierscangivedecisionboundariesthatareintersectionsofhalfspaces

o reducevariance§ averagingclassifiers𝑓", 𝑓$, … 𝑓& trainedindependentlyondifferentiiddatasets𝑆", 𝑆$, … , 𝑆& canreducevarianceofcompositeclassifier

2

Reducing bias using ensembles

3

Decisiontrees

• Eachnon-leafnodetestsabinaryconditiononsomefeature𝑥@o ifconditionsatisfiesthengoleft,elsegoright

o leafnodeshavelabel(typicallythelabelofmajorityclassoftrainingexamplesatthatnode

• Classifyingapointbydecisiontreecanbeseenasasequenceofclassifiersrefinedaswefollowthepathtoaleaf.

4

Combining“simple”models

• Smooth-ish tradeoffbetweenbias-complexityo startwithsimplemodelswithlargebiasandlowvarianceo learnmorecomplexclassesbycomposign simplemodels

• Forexampleconsiderclassifiers𝑓", 𝑓$, … , 𝑓& basedononlyonefeature(decisionstumps),i.e.,each

𝑓1 𝑥; 𝜃1 = 1 𝑥@C ≥ 𝜏1 where𝜃1 = (𝑘1, 𝜏1)• ℋ = {𝑥 → 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦(𝛼"𝑓"(𝑥; 𝜃"), 𝛼$𝑓$(𝑥; 𝜃$), … , 𝛼&𝑓&(𝑥; 𝜃&))}containsverycomplexboundaries

• demo(byNati Srebro)

• Soclearlycombiningsimpleclassifierscanreducebias.Howdowecombineclassifiers?

5

reality

Figurecredit:Nati Srebro

Combining“simple”models• Givenafamilyofmodels𝑓", 𝑓$, … :𝒳 → 𝒴, wewanttocombine?• Weightedaveragingofmodels:

o parameterizecombinedclassifierusing𝛼1 as

𝐹Q 𝑥 = ∑ 𝛼1𝑓1(𝑥)&19"

o minimizelossovercombinedmodel

minQ∑ ℓ 𝐹Q 𝑥 T , 𝑦 TUT9"

• Alternativealgorithm:greedyapproacho 𝐹V 𝑥 = 0o foreachround𝑡 = 1,2, … , 𝑇

§ findthebestmodeltominimizetheincrementalchangefrom𝐹YZ"

minQ[,\ [

]ℓ 𝐹YZ" 𝑥 T + 𝛼Y𝑓 Y 𝑥 T , 𝑦 TU

T9"

o Outputclassifier𝐹_ 𝑥 = ∑ 𝛼Y𝑓 Y (𝑥)_Y9"

6

Adaboost

7

Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}

• Maintainweights𝑊TY foreachexample

𝑥 T , 𝑦 T ,initiallyall𝑊T" = "

U

• For𝑡 = 1,2, … , T

o Normalizeweights𝐷TY = de

[

∑ de[�

e

o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T

Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"

o Set𝛼Y ="$log "

k[− 1

o Updateweights𝑊T

Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T

• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y

Examplecredit:GregShaknarovich

Adaboost

8




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e



o Set𝛼Y ="$log "Zk[

k[o Updateweights𝑊T




Adaboost

9




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

10




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

11




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

12




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

13




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

14




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

15




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost

16




U

• For𝑡 = 1,2, … , T


[

∑ de[�

e








Adaboost• Demoagain(codebyNati Srebro)• WhatarewedoinginAdaboost?

o Somealgorithmtodoensembleso Learningsparse linearpredictorswith large(infinite?)dimensionalfeatures§ Sparsitycontrolscomplexity§ Numberofiterationscontrolssparsityè earlystoppingasregularization

o Coordinatedescentonexponentialloss(brieflynext)• VariantsofAdaBoost

o FloatBoost:Aftereachround,seeifremovalofapreviouslyaddedclassierishelpful.

o TotallycorrectiveAdaBoost:updatethe𝛼psforallweakclassifiersselectedsofarbyminimizingloss

17

Exponentialloss

• Exponentiallossℓ 𝑓 𝑥 , 𝑦 = exp −𝑓 𝑥 𝑦 anothersurrogateloss• Ensembleclassifier𝐹Q 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�

Y

• Wewillnotderive,butcanshowthatadaboost updatescorrespondtocoordinatedescentonERMwithexp loss

minQ]exp −]𝛼Y

�

Y

𝑓Y 𝑥 T 𝑦 TU

T9"

18

Example:Viola-JonesFaceDetector

• Classifyeachsquareinanimageas“face”or“no-face”

• 𝒳 = patchesof24x24pixels,say

Slidecredit:Nati Srebro

Viola-Jones“WeakPredictors”/Featuresℬ = 1 𝑔s,Y 𝑥 < 𝜃 |𝜃 ∈ ℝ, rect𝑟inimage, 𝑡 ∈ 𝐴, 𝐵, 𝐶, 𝐷, �̅�, 𝐵�, 𝐶̅, 𝐷�

where𝑔s,Y 𝑥 =sumof“blue”pixels– sumof“red”pixels

A: B: C: D:


Viola-JonesFaceDetector

• Simpleimplementationofboostingusinggeneric(non-facespecific)“weaklearners”/features

o Canbeusedalsofordetectingotherobjects

• Efficientmethodusingdynamicprogramingandcachingtofindgoodweakpredictor

• About1millionpossible𝑔s,Y,butonlyveryfewusedinreturnedpredictor• Sparsity:

è Generalizationè Predictionspeed!(andsmallmemoryfootprint)

• Toruninreal-time(on2001laptop),usesequentialevaluationo FirstevaluatefirstfewℎY togetroughpredictiono OnlyevaluateadditionalℎY onpatcheswheretheleadingonesarepromising


Ensembling to reduce variance

22

Averagingpredictors• Averagingreducesvariance:if𝑍", 𝑍$, … , 𝑍& areindependentrandomvariableseachwithmean𝜇 andvarianceof𝜎$

𝑣𝑎𝑟 "&∑ 𝑍1&19" = ��

&

• Whathappenstomean?

𝔼 "&∑ 𝑍1&19" = 𝜇

23

Averagingpredictors• Averagingreducesvariance:if𝑍", 𝑍$, … , 𝑍& areindependentrandomvariableseachwithmean𝜇 andvarianceof𝜎$

𝑣𝑎𝑟 "&∑ 𝑍1&19" = ��

&

• Whathappenstomean?

𝔼 "&∑ 𝑍1&19" = 𝜇

• Ifwehad𝑀 models𝑓", 𝑓$, … 𝑓& trainedindependentlyondifferentiiddatasets𝑆", 𝑆$, … , 𝑆&,thenaveragingtheresultsofthemodels willo reducevariance:itwillbelesssensitivetospecifictrainingdatao withoutincreasingthebias:onaverageallclassifierswilldoaswell

• Butwehaveonlyonedataset!Howdowegetmultiplemodelso Rememberthemodelshavetobeindependent!

24

Bagging:BootstrapaggregationAveragingindependentmodelsreducesvariancewithoutincreasingbias.• Butwedon’thaveindependentdatasets!

o Insteadtakerepeatedbootstrapsamplesfromtrainingset𝑆• Bootstrapsampling:Givendataset𝑆 = 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁 ,create𝑆p bydrawing𝑁 examplesatrandomwithreplacementfrom𝑆

• Bagging:o CreateMbootstrapdatasets𝑆", 𝑆$, … , 𝑆&

o Traindistinctmodels𝑓1:𝒳 → 𝒴bytrainingonlyon𝑆1

o Outputfinalpredictor𝐹 𝑥 = "

&∑ 𝑓1 𝑥&19" (forregression)

or𝐹 𝑥 = majority(𝑓1 𝑥 ) (forclassification)

25Figurecredit:DavidSontag

Bagging

• Mosteffectivewhilecombininghighvariance,lowbiaspredictorso unstablenon-linearpredictorslikedecisiontreeso “overfittingquirks”ofdifferenttreescancelingout

• Notveryusefulwithlinearpredictors

• Usefulpropertyofbagging:“outofbag”(OOB)datao ineach“bag”,treattheexamplesthatdidn'tmakeittothebagasakindofvalidationset

o whilelearningpredictors,keeptrackofOOBaccuracy

26

Baggingexample

27

OutputofsingleDT 100baggedtrees

Slide/examplecredit:DavidSontag

Randomforests

• Ensemblemethodspecificallybuiltfordecisiontrees

• Twosourcesofrandomnesso Samplebagging: Eachtreegrownwithabootstrappedtrainingdatao Featurebagging:ateachnode,bestsplitdecidedoveronlyasubsetofrandomfeaturesà increasesdiversityamongtrees

• Algorithmo Createbootsrapped datasets𝑆", 𝑆$, … , 𝑆&o Foreach𝑚,growadecisiontree𝑇1 byrepeatingthefollowingateachnodeuntilsomestoppingcondition

§ select𝐾 featuresatrandomfrom𝑑 featuresof𝑥§ pickbestvariable/splitthresholdamongtheKselectedfeatures§ splitthenodeintotwochildnodesbasedonabovecondition

o Outputmajorityvoteof 𝑇1 19"&

28

Ensemblessummary

• Reducebias:o buildensembleoflow-variance,high-biaspredictorssequentiallytoreducebias

o AdaBoost:binaryclassication,exponentialsurrogateloss

• Reducevariance:o buildensembleofhigh-variance,low-biaspredictorsinparallel anduserandomnessandaveragingtoreducevariance

o randomforests,bagging

• Problemso Computationallyexpensive(trainandtesttime)o Oftenlooseinterpretability

29

day 8: ensemble methods, boosting - university of chicagosuriya/website-intromlss2018/... · 2018....

Documents