day 8: ensemble methods, boosting - university of chicagosuriya/website-intromlss2018/... · 2018....
TRANSCRIPT
Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago
Instructor:SuriyaGunasekar,TTIChicago
27June2018
Day8:Ensemblemethods,boosting
Topicssofar
• Linearregression• Classification
o Logisticregressiono Maximummarginclassifiers,kerneltricko Generativemodelso Neuralnetworks,backpropagation,NNtraining– optimizationandregularization,specialarchitectures– CNNs,RNNs,encoder-decoder
• RemainingTopicso Ensemblemethods,boostingo Unsupervisedlearning– clustering,dimensionalityreductiono Reviewandtopicsnotcovered!
1
Ensemblelearning• Ensemblelearning
o Createapopulationofbaselearning𝑓", 𝑓$, … 𝑓&:𝒳 → 𝒴o Combinethepredictorstoformacompositepredictor
• Exampleinclassificationwith𝒴 = {−1,1}à assign“votes”𝛼1 toeachclassifier𝑓1 andtakeweighted-majorityvote
𝐹 𝑥 = sign ∑ 𝛼1𝑓1 𝑥&19"
o Individualclassifierscanbeverysimple,e.g.,𝑥" ≥ 10,𝑥< ≤ 5• Why?
o morepowerfulmodelsà reducebias§ e.g.,majorityvoteoflinearclassifierscangivedecisionboundariesthatareintersectionsofhalfspaces
o reducevariance§ averagingclassifiers𝑓", 𝑓$, … 𝑓& trainedindependentlyondifferentiiddatasets𝑆", 𝑆$, … , 𝑆& canreducevarianceofcompositeclassifier
2
Reducing bias using ensembles
3
Decisiontrees
• Eachnon-leafnodetestsabinaryconditiononsomefeature𝑥@o ifconditionsatisfiesthengoleft,elsegoright
o leafnodeshavelabel(typicallythelabelofmajorityclassoftrainingexamplesatthatnode
• Classifyingapointbydecisiontreecanbeseenasasequenceofclassifiersrefinedaswefollowthepathtoaleaf.
4
Combining“simple”models
• Smooth-ish tradeoffbetweenbias-complexityo startwithsimplemodelswithlargebiasandlowvarianceo learnmorecomplexclassesbycomposign simplemodels
• Forexampleconsiderclassifiers𝑓", 𝑓$, … , 𝑓& basedononlyonefeature(decisionstumps),i.e.,each
𝑓1 𝑥; 𝜃1 = 1 𝑥@C ≥ 𝜏1 where𝜃1 = (𝑘1, 𝜏1)• ℋ = {𝑥 → 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦(𝛼"𝑓"(𝑥; 𝜃"), 𝛼$𝑓$(𝑥; 𝜃$), … , 𝛼&𝑓&(𝑥; 𝜃&))}containsverycomplexboundaries
• demo(byNati Srebro)
• Soclearlycombiningsimpleclassifierscanreducebias.Howdowecombineclassifiers?
5
reality
Figurecredit:Nati Srebro
Combining“simple”models• Givenafamilyofmodels𝑓", 𝑓$, … :𝒳 → 𝒴, wewanttocombine?• Weightedaveragingofmodels:
o parameterizecombinedclassifierusing𝛼1 as
𝐹Q 𝑥 = ∑ 𝛼1𝑓1(𝑥)&19"
o minimizelossovercombinedmodel
minQ∑ ℓ 𝐹Q 𝑥 T , 𝑦 TUT9"
• Alternativealgorithm:greedyapproacho 𝐹V 𝑥 = 0o foreachround𝑡 = 1,2, … , 𝑇
§ findthebestmodeltominimizetheincrementalchangefrom𝐹YZ"
minQ[,\ [
]ℓ 𝐹YZ" 𝑥 T + 𝛼Y𝑓 Y 𝑥 T , 𝑦 TU
T9"
o Outputclassifier𝐹_ 𝑥 = ∑ 𝛼Y𝑓 Y (𝑥)_Y9"
6
Adaboost
7
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "
k[− 1
o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
8
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
9
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
10
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
11
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
12
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
13
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
14
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
15
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost
16
Trainingdata𝑆 = { 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁}
• Maintainweights𝑊TY foreachexample
𝑥 T , 𝑦 T ,initiallyall𝑊T" = "
U
• For𝑡 = 1,2, … , T
o Normalizeweights𝐷TY = de
[
∑ de[�
e
o Pickaclassifier𝑓Yhasbetterthan0.5weightedloss𝜖Y = ∑ 𝐷T
Y ℓV" 𝑓Y 𝑥 T , 𝑦 TUT9"
o Set𝛼Y ="$log "Zk[
k[o Updateweights𝑊T
Yl" = 𝑊TY exp −𝛼Y𝑦 T 𝑓Y 𝑥 T
• Outputstrongclassifier𝐹_ 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�Y
Examplecredit:GregShaknarovich
Adaboost• Demoagain(codebyNati Srebro)• WhatarewedoinginAdaboost?
o Somealgorithmtodoensembleso Learningsparse linearpredictorswith large(infinite?)dimensionalfeatures§ Sparsitycontrolscomplexity§ Numberofiterationscontrolssparsityè earlystoppingasregularization
o Coordinatedescentonexponentialloss(brieflynext)• VariantsofAdaBoost
o FloatBoost:Aftereachround,seeifremovalofapreviouslyaddedclassierishelpful.
o TotallycorrectiveAdaBoost:updatethe𝛼psforallweakclassifiersselectedsofarbyminimizingloss
17
Exponentialloss
• Exponentiallossℓ 𝑓 𝑥 , 𝑦 = exp −𝑓 𝑥 𝑦 anothersurrogateloss• Ensembleclassifier𝐹Q 𝑥 = sign ∑ 𝛼Y𝑓Y 𝑥�
Y
• Wewillnotderive,butcanshowthatadaboost updatescorrespondtocoordinatedescentonERMwithexp loss
minQ]exp −]𝛼Y
�
Y
𝑓Y 𝑥 T 𝑦 TU
T9"
18
Example:Viola-JonesFaceDetector
• Classifyeachsquareinanimageas“face”or“no-face”
• 𝒳 = patchesof24x24pixels,say
Slidecredit:Nati Srebro
Viola-Jones“WeakPredictors”/Featuresℬ = 1 𝑔s,Y 𝑥 < 𝜃 |𝜃 ∈ ℝ, rect𝑟inimage, 𝑡 ∈ 𝐴, 𝐵, 𝐶, 𝐷, �̅�, 𝐵�, 𝐶̅, 𝐷�
where𝑔s,Y 𝑥 =sumof“blue”pixels– sumof“red”pixels
A: B: C: D:
Slidecredit:Nati Srebro
Viola-JonesFaceDetector
• Simpleimplementationofboostingusinggeneric(non-facespecific)“weaklearners”/features
o Canbeusedalsofordetectingotherobjects
• Efficientmethodusingdynamicprogramingandcachingtofindgoodweakpredictor
• About1millionpossible𝑔s,Y,butonlyveryfewusedinreturnedpredictor• Sparsity:
è Generalizationè Predictionspeed!(andsmallmemoryfootprint)
• Toruninreal-time(on2001laptop),usesequentialevaluationo FirstevaluatefirstfewℎY togetroughpredictiono OnlyevaluateadditionalℎY onpatcheswheretheleadingonesarepromising
Slidecredit:Nati Srebro
Ensembling to reduce variance
22
Averagingpredictors• Averagingreducesvariance:if𝑍", 𝑍$, … , 𝑍& areindependentrandomvariableseachwithmean𝜇 andvarianceof𝜎$
𝑣𝑎𝑟 "&∑ 𝑍1&19" = ��
&
• Whathappenstomean?
𝔼 "&∑ 𝑍1&19" = 𝜇
23
Averagingpredictors• Averagingreducesvariance:if𝑍", 𝑍$, … , 𝑍& areindependentrandomvariableseachwithmean𝜇 andvarianceof𝜎$
𝑣𝑎𝑟 "&∑ 𝑍1&19" = ��
&
• Whathappenstomean?
𝔼 "&∑ 𝑍1&19" = 𝜇
• Ifwehad𝑀 models𝑓", 𝑓$, … 𝑓& trainedindependentlyondifferentiiddatasets𝑆", 𝑆$, … , 𝑆&,thenaveragingtheresultsofthemodels willo reducevariance:itwillbelesssensitivetospecifictrainingdatao withoutincreasingthebias:onaverageallclassifierswilldoaswell
• Butwehaveonlyonedataset!Howdowegetmultiplemodelso Rememberthemodelshavetobeindependent!
24
Bagging:BootstrapaggregationAveragingindependentmodelsreducesvariancewithoutincreasingbias.• Butwedon’thaveindependentdatasets!
o Insteadtakerepeatedbootstrapsamplesfromtrainingset𝑆• Bootstrapsampling:Givendataset𝑆 = 𝑥 T , 𝑦 T : 𝑖 = 1,2, … , 𝑁 ,create𝑆p bydrawing𝑁 examplesatrandomwithreplacementfrom𝑆
• Bagging:o CreateMbootstrapdatasets𝑆", 𝑆$, … , 𝑆&
o Traindistinctmodels𝑓1:𝒳 → 𝒴bytrainingonlyon𝑆1
o Outputfinalpredictor𝐹 𝑥 = "
&∑ 𝑓1 𝑥&19" (forregression)
or𝐹 𝑥 = majority(𝑓1 𝑥 ) (forclassification)
25Figurecredit:DavidSontag
Bagging
• Mosteffectivewhilecombininghighvariance,lowbiaspredictorso unstablenon-linearpredictorslikedecisiontreeso “overfittingquirks”ofdifferenttreescancelingout
• Notveryusefulwithlinearpredictors
• Usefulpropertyofbagging:“outofbag”(OOB)datao ineach“bag”,treattheexamplesthatdidn'tmakeittothebagasakindofvalidationset
o whilelearningpredictors,keeptrackofOOBaccuracy
26
Baggingexample
27
OutputofsingleDT 100baggedtrees
Slide/examplecredit:DavidSontag
Randomforests
• Ensemblemethodspecificallybuiltfordecisiontrees
• Twosourcesofrandomnesso Samplebagging: Eachtreegrownwithabootstrappedtrainingdatao Featurebagging:ateachnode,bestsplitdecidedoveronlyasubsetofrandomfeaturesà increasesdiversityamongtrees
• Algorithmo Createbootsrapped datasets𝑆", 𝑆$, … , 𝑆&o Foreach𝑚,growadecisiontree𝑇1 byrepeatingthefollowingateachnodeuntilsomestoppingcondition
§ select𝐾 featuresatrandomfrom𝑑 featuresof𝑥§ pickbestvariable/splitthresholdamongtheKselectedfeatures§ splitthenodeintotwochildnodesbasedonabovecondition
o Outputmajorityvoteof 𝑇1 19"&
28
Ensemblessummary
• Reducebias:o buildensembleoflow-variance,high-biaspredictorssequentiallytoreducebias
o AdaBoost:binaryclassication,exponentialsurrogateloss
• Reducevariance:o buildensembleofhigh-variance,low-biaspredictorsinparallel anduserandomnessandaveragingtoreducevariance
o randomforests,bagging
• Problemso Computationallyexpensive(trainandtesttime)o Oftenlooseinterpretability
29