dr. yanjun qi / uva cs 6316 / f16 uva cs 6316/4501 – fall ... · machine learning lecture 16:...

67
UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science 11/9/16 Dr. Yanjun Qi / UVA CS 6316 / f16 1

Upload: others

Post on 21-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

UVACS6316/4501–Fall2016

MachineLearning

Lecture16:DecisionTree/RandomForest/Ensemble

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/9/16

Dr.YanjunQi/UVACS6316/f16

1

Page 2: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Wherearewe?èFivemajorsecLonsofthiscourse

q Regression(supervised)q ClassificaLon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

11/9/16 2

Dr.YanjunQi/UVACS6316/f16

Page 3: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

11/9/16 3

hSp://scikit-learn.org/stable/tutorial/machine_learning_map/Dr.YanjunQi/UVACS6316/f16

ChoosingtherightesLmator

Page 4: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

11/9/16 4

Scikit-learn:Regression

LinearmodelfiSedbyminimizingaregularizedempiricallosswithSGD

Dr.YanjunQi/UVACS6316/f16

Page 5: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Scikit-learn:ClassificaLon

11/9/16 5

Linearclassifiers(SVM,logisLcregression…)withSGDtraining.

approximatetheexplicitfeaturemappingsthatcorrespondtocertainkernelsTocombinethe

predicLonsofseveralbaseesLmatorsbuiltwithagivenlearningalgorithminordertoimprovegeneralizability/robustnessoverasingleesLmator.(1)averaging/bagging(2)boosLng

Dr.YanjunQi/UVACS6316/f16

Page 6: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

11/9/16 6

BasicPCA

Bayes-NetHMM

Kmeans+GMM

nextaeerclassificaLon?Dr.YanjunQi/UVACS6316/f16

Page 7: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Today

Ø DecisionTree(DT):Ø TreerepresentaLon

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 7

Dr.YanjunQi/UVACS6316/f16

Page 8: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

A study comparing Classifiers

11/9/16 8

Dr.YanjunQi/UVACS6316/f16

Proceedingsofthe23rdInternaLonalConferenceonMachineLearning(ICML`06).

Page 9: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

A study comparing Classifiers è 11binaryclassificaLonproblems/8metrics

11/9/16 9

Dr.YanjunQi/UVACS6316/f16

Top8Models

Page 10: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Wherearewe?èThreemajorsecLonsforclassificaLon

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

11/9/16 10

Dr.YanjunQi/UVACS6316/f16

Page 11: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

ADatasetforclassificaLon

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns,exceptthelast]•  Target/outcome/response/label/dependentvariable:specialcolumntobepredicted[lastcolumn]

11/9/16 11

Output as Discrete Class Label

C1, C2, …, CL

C

CDr.YanjunQi/UVACS6316/f16

Page 12: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Example

•  Example: Play Tennis

C

11/9/16 12

Dr.YanjunQi/UVACS6316/f16

Page 13: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Anatomyofadecisiontree

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Outlook

HumidityWindy

Eachnodeisatestononefeature/aJribute

PossibleaSributevaluesofthenode

Leavesarethedecisions

11/9/16 13

Dr.YanjunQi/UVACS6316/f16

Page 14: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Anatomyofadecisiontree

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Outlook

HumidityWindy

EachnodeisatestononeaJribute

PossibleaSributevaluesofthenode

Leavesarethedecisions

Samplesize

Yourdatagetssmaller

11/9/16 14

Dr.YanjunQi/UVACS6316/f16

Page 15: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

ApplyModeltoTestData:To‘playtennis’ornot.

overcast

high normal false true

sunny rain

No No Yes Yes

Yes

Outlook

Humidity Windy

Anewtestexample:(Outlook==rain)and(Windy==false)Passitonthetree->Decisionisyes.

11/9/16 15

Dr.YanjunQi/UVACS6316/f16

Page 16: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

ApplyModeltoTestData:To‘playtennis’ornot.

overcast

high normal false true

sunny rain

No No Yes Yes

Yes

Outlook

Humidity Windy

(Outlook==overcast)->yes(Outlook==rain)and(Windy==false)->yes(Outlook==sunny)and(Humidity=normal)->yes

11/9/16 16

Dr.YanjunQi/UVACS6316/f16

Threecasesof“YES”

Page 17: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Decisiontrees•  DecisiontreesrepresentadisjuncLonofconjuncLonsofconstraintsontheaSributevaluesofinstances.

•  (Outlook ==overcast) •  OR •  ((Outlook==rain) and (Windy==false)) •  OR •  ((Outlook==sunny) and (Humidity=normal)) •  => yes play tennis

11/9/16 17

Dr.YanjunQi/UVACS6316/f16

Page 18: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

RepresentaLon

0 A 1

C B 0 1 1 0

false true false

Y=((AandB)or((notA)andC))

true

11/9/16 18

Dr.YanjunQi/UVACS6316/f16

Page 19: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Sameconcept/differentrepresentaLon

0 A 1

C B 0 1 1 0

false true false

Y=((AandB)or((notA)andC))

true 0 C 1

B A 0 1 0 1

false true false A 0 1

true false 11/9/16 19

Dr.YanjunQi/UVACS6316/f16

Page 20: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

WhichaJributetoselectforspli\ng?

16+16-

8+8-

8+8-

4+4-

4+4-

4+4-

4+4-

2+2-

2+2-

Thisisbadspliqng…

thedistribuLonofeachclass(notaSribute)

11/9/16 20

Dr.YanjunQi/UVACS6316/f16

Page 21: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

HowdowechoosewhichaJributetosplit?

WhichaSributeshouldbeusedfirsttotest?IntuiLvely,youwouldprefertheonethatseparatesthetrainingexamplesasmuchaspossible.

11/9/16 21

Dr.YanjunQi/UVACS6316/f16

Page 22: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Today

Ø DecisionTree(DT):Ø TreerepresentaLon

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 22

Dr.YanjunQi/UVACS6316/f16

Page 23: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

InformaLongainisonecriteriatodecideonwhichaSributeforspliqng

•  Imagine:–  1.Someoneisabouttotellyouyourownname–  2.Youareabouttoobservetheoutcomeofadiceroll–  2.Youareabouttoobservetheoutcomeofacoinflip–  3.Youareabouttoobservetheoutcomeofabiasedcoinflip

•  EachsituaLonhaveadifferentamountofuncertaintyastowhatoutcomeyouwillobserve.

11/9/16 23

Dr.YanjunQi/UVACS6316/f16

Page 24: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

InformaLon•  InformaLon:èReducLoninuncertainty(amountofsurpriseintheoutcome)

2 21( ) log log ( )( )

I E p xp x

= = −

Ø  Observingtheoutcomeofacoinflipishead

Ø  Observetheoutcomeofadiceis6

2log 1/ 2 1I = − =

2log 1/ 6 2.58I = − =

Iftheprobabilityofthiseventhappeningissmallandithappens,theinformaLonislarge.

11/9/16 24

Dr.YanjunQi/UVACS6316/f16

Page 25: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Entropy•  Theexpectedamountofinforma9onwhenobservingthe

outputofarandomvariableX

2( ) ( ( )) ( ) ( ) ( ) log ( )i i i ii i

H X E I X p x I x p x p x= = = −∑ ∑

IftheXcanhave8outcomesandallareequallylikely

2( ) 1/8log 1/8 3i

H X == − =∑

11/9/16 25

Dr.YanjunQi/UVACS6316/f16

Page 26: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Entropy•  Iftherearekpossible

outcomes

•  Equalityholdswhenalloutcomesareequallylikely

•  ThemoretheprobabilitydistribuLonthatdeviatesfromuniformity,thelowertheentropy

2( ) logH X k≤

e.g.forarandombinaryvariable11/9/16 26

Dr.YanjunQi/UVACS6316/f16

Page 27: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

EntropyLowerèbeJerpurity

•  Entropymeasuresthepurity

4+4-

8+0-

ThedistribuLonislessuniformEntropyislowerThenodeispurer

11/9/16 27

Dr.YanjunQi/UVACS6316/f16

Page 28: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Informa_ongain

•  IG(X,Y)=H(Y)-H(Y|X)ReducLoninuncertaintyofYbyknowingafeaturevariableX

InformaLongain:=(informaLonbeforesplit)–(informaLonaeersplit)=entropy(parent)–[averageentropy(children)]

Fixed thelower,thebeSer(childrennodesarepurer)

– ForIG,thehigher,thebeSer=

11/9/16 28

Dr.YanjunQi/UVACS6316/f16

Page 29: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Condi_onalentropy

H (Y ) = − p(yi )log2 p(yi )i∑

H (Y | X ) = p(x j )

j∑ H (Y | X = x j )

= − p(x j )

j∑ p( yi | x j ) log2 p( yi | x j )

i∑

11/9/16 29

Dr.YanjunQi/UVACS6316/f16

H (Y | X = x j ) = − p( yi | x j ) log2 p( yi | x j )

i∑

Page 30: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

ExampleX1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

ASributes Labels

IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6)

= 0.39

InformaLongain(X1,Y)=1-0.39=0.61

WhichonedowechooseX1orX2?

11/9/16 30

Dr.YanjunQi/UVACS6316/f16

Page 31: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

ExampleX1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

ASributes Labels

IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6)

= 0.39

InformaLongain(X1,Y)=1-0.39=0.61

WhichonedowechooseX1orX2?

11/9/16 31

Dr.YanjunQi/UVACS6316/f16

Page 32: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Whichonedowechoose?

X1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

Information gain (X1,Y)= 0.61 Information gain (X2,Y)= 0.12

PickX1Pick the variable which provides the most information gain about Y

èThenrecursivelychoosenextXionbranches

X1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

Onebranch

Theotherbranch

11/9/16 32

Dr.YanjunQi/UVACS6316/f16

Page 33: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

11/9/16

Dr.YanjunQi/UVACS6316/f16

33

Page 34: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

DecisionTrees•  Caveats:Thenumberofpossiblevaluesinfluencesthe

informaLongain.•  Themorepossiblevalues,thehigherthegain(themorelikelyitisto

formsmall,butpureparLLons)

•  OtherPurity(diversity)measures–  InformaLonGain–  Gini(populaLondiversity)

•  wherepmkisproporLonofclasskatnodem

–  Chi-squareTest

11/9/16 34

Dr.YanjunQi/UVACS6316/f16

Page 35: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Overfi\ng

•  YoucanperfectlyfitDTtoanytrainingdata

•  InstabilityofTrees○  Highvariance(smallchangesintrainingsetwill

resultinchangesoftreemodel)○  HierarchicalstructureèErrorintopsplit

propagatesdown

•  Twoapproaches:–  1.Stopgrowingthetreewhenfurtherspliqngthedatadoesnot

yieldanimprovement–  2.Growafulltree,thenprunethetree,byeliminaLngnodes.

11/9/16 35

Dr.YanjunQi/UVACS6316/f16

Page 36: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

FromESLbookCh9:ClassificaLonandRegressionTrees(CART)●  Par__onfeature

spaceintosetofrectangles

●  Fitsimplemodelin

eachparLLon

Page 37: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Summary:Decisiontrees

•  Non-linearclassifier•  Easytouse•  Easytointerpret•  SuscepLbletooverfiqngbutcanbeavoided.

11/9/16 37

Dr.YanjunQi/UVACS6316/f16

Page 38: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Decision Tree / Random Forest

Greedy to find partitions

Split Purity measure / e.g. IG / cross-entropy / Gini /

Tree Model (s), i.e. space partition

Task

Representation

Score Function

Search/Optimization

Models, Parameters

11/9/16 38

Classification

Partition feature space into set of rectangles, local smoothness

Dr.YanjunQi/UVACS6316/f16

Page 39: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Today

Ø DecisionTree(DT):Ø TreerepresentaLon

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 39

Dr.YanjunQi/UVACS6316/f16

Page 40: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Bagging

•  Baggingorbootstrapaggrega9on•  atechniqueforreducingthevarianceofanesLmatedpredicLonfuncLon.

•  Forinstance,forclassificaLon,acommi0eeoftrees•  Eachtreecastsavoteforthepredictedclass.

11/9/16 40

Dr.YanjunQi/UVACS6316/f16

Page 41: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

BootstrapThebasicidea:randomlydrawdatasetswithreplacement(i.e.allowsduplicates)fromthetrainingdata,eachsamplethesamesizeastheoriginaltrainingset

11/9/16 41

Dr.YanjunQi/UVACS6316/f16

Page 42: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

WithvsWithoutReplacement

•  Bootstrapwithreplacementcankeepthesamplingsizethesameastheoriginalsizeforeveryrepeatedsampling.Thesampleddatagroupsareindependentoneachother.

•  Bootstrapwithoutreplacementcannotkeepthesamplingsizethesameastheoriginalsizeforeveryrepeatedsampling.Thesampleddatagroupsaredependentoneachother.

11/9/16 42

Dr.YanjunQi/UVACS6316/f16

Page 43: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

BaggingNexamples

Createbootstrapsamplesfromthetrainingdata

....…

Mfeatures

11/9/16 43

Dr.YanjunQi/UVACS6316/f16

Page 44: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

BaggingofDTClassifiersNexamples

....…

....…

Takethemajorityvote

Mfeatures

e.g.

i.e.Refitthemodeltoeachbootstrapdataset,andthenexaminethebehaviorovertheBreplicaLons.

11/9/16 44

Dr.YanjunQi/UVACS6316/f16

Page 45: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

BaggingforClassifica_onwith0,1Loss

•  ClassificaLonwith0,1loss–  BaggingagoodclassifiercanmakeitbeJer.

–  Baggingabadclassifiercanmakeitworse.

–  Canunderstandthebaggingeffectintermsofaconsensusofindependentweakleanersandwisdomofcrowds

11/9/16 45

Dr.YanjunQi/UVACS6316/f16

Page 46: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Peculiari_es

•  ModelInstabilityisgoodwhenbagging

–  Themorevariable(unstable)thebasicmodelis,themoreimprovementcanpotenLallybeobtained

–  Low-Variabilitymethods(e.g.LDA)improvelessthanHigh-Variabilitymethods(e.g.decisiontrees)

•  LoadofRedundancy–  Mostpredictorsdoroughly“thesamething”

11/9/16 46

Dr.YanjunQi/UVACS6316/f16

Page 47: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Bagging:ansimulatedexample •  N=30trainingsamples,•  twoclassesandp=5features,•  EachfeatureN(0,1)distribuLonandpairwisecorrelaLon.95•  ResponseYgeneratedaccordingto:•  Testsamplesizeof2000•  FitclassificaLontreestotrainingsetandbootstrapsamples•  B=200

ESLbook/Example8.7.1

Page 48: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

NoLcethebootstraptreesaredifferentthantheoriginaltree

FivefeatureshighlycorrelatedwitheachotherèNocleardifferencewithpickingupwhichfeaturetosplitè Small

changesinthetrainingsetwillresultindifferenttree

è ButthesetreesareactuallyquitesimilarforclassificaLon

ESLbook/Example8.7.1

Page 49: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

•  Consensus:Majorityvote•  Probability:AveragedistribuLonatterminalnodes

ESLbook/Example8.7.1

B

è ForB>30,moretreesdonotimprovethebaggingresults

è Sincethetrees

correlatehighlytoeachotherandgivesimilarclassificaLons

Page 50: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Bagging

•  Slightlyincreasesmodelspace– Cannothelpwheregreaterenlargementofspaceisneeded

•  Baggedtreesarecorrelated– UserandomforesttoreducecorrelaLonbetweentrees

Page 51: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Today

Ø DecisionTree(DT):Ø TreerepresentaLon

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:specialensembleofDTØ Moreaboutensemble11/9/16 51

Dr.YanjunQi/UVACS6316/f16

Page 52: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Randomforestclassifier

•  Randomforestclassifier,– anextensiontobagging– whichusesde-correlatedtrees.

11/9/16 52

Dr.YanjunQi/UVACS6316/f16

Page 53: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

RandomForestClassifierNexamples

Createbootstrapsamplesfromthetrainingdata

....…

Mfeatures

11/9/16 53

Dr.YanjunQi/UVACS6316/f16

Page 54: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

RandomForestClassifierNexamples

....…

Mfeatures

Ateachnodewhenchoosingthesplitfeaturechooseonlyamongm<Mfeatures

11/9/16 54

Dr.YanjunQi/UVACS6316/f16

Page 55: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

RandomForestClassifierCreatedecisiontree

fromeachbootstrapsample

Nexamples

....…

....…

Mfeatures

11/9/16 55

Dr.YanjunQi/UVACS6316/f16

Page 56: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

RandomForestClassifierNexamples

....…

....…

Takehemajorityvote

Mfeatures

11/9/16 56

Dr.YanjunQi/UVACS6316/f16

Page 57: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Random Forests

1. ForeachofourBbootstrapsamplesa.  Formatreeinthefollowingmanner

i.  Givenpdimensions,pickmofthemii.  Splitonlyaccordingtothesemdimensions

1.  (wewillNOTconsidertheotherp-mdimensions)

iii.  Repeattheabovestepsi&iiforeachsplit1.  Note:wepickadifferentsetofmdimensionsforeachsplit

onasingletree

Page 58: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

11/9/16 58

Dr.YanjunQi/UVACS6316/f16

Page598-599InESLbook

Page 59: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Random Forests •  Randomforestcanbeviewedasarefinementofbaggingwitha

tweakofdecorrela_ngthetrees:

o  Ateachtreesplit,arandomsubsetofmfeaturesoutofallpfeaturesisdrawntobeconsideredforspliqng

•  SomeguidelinesprovidedbyBreiman,butbecarefultochoosembasedonspecificproblem:

o  m=pamountstobaggingo  m=p/3orlog2(p)forregressiono  m=sqrt(p)forclassificaLon

Page 60: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Why correlated trees are not ideal ?

•  RandomForeststrytoreducecorrelaLonbetweenthetrees.

• Why?

Page 61: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Why correlated trees are not ideal ?

•  Assumingeachtreehasvarianceσ2

•  IftreesareindependentlyidenLcallydistributed,thenaveragevarianceisσ2/B

Page 62: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Why correlated trees are not ideal ?

•  Assumingeachtreehasvarianceσ2

•  IfsimplyidenLcallydistributed,thenaveragevarianceis

ρreferstopairwisecorrelaLon,aposiLvevalue

•  AsB→∞,secondterm→0•  Thus,thepairwisecorrelaLonalwaysaffectsthevariance

Page 63: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Why correlated trees are not ideal ?

•  Howtodeal?o  Ifwereducem(thenumberofdimensionswe

actuallyconsider),o  thenwereducethepairwisetreecorrelaLon

o  Thus,variancewillbereduced.

Page 64: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Today

Ø DecisionTree(DT):Ø TreerepresentaLon

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreensemble

11/9/16 64

Dr.YanjunQi/UVACS6316/f16

Page 65: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

e.g. Ensembles in practice

Oct2006-2009

EachraLng/sample:+<user,movie,dateofgrade,grade>Trainingset(100,480,507raLngs)Qualifyingset(2,817,131raLngs)èwinner

Page 66: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

Ensemble in practice Team“Bellkor'sPragmaLcChaos”defeatedtheteam“ensemble”bysubmiqngjust20minutesearlier!è1milliondollar!

TheensembleteamèblendersofmulLpledifferentmethods

Page 67: Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department

References

q Prof.Tan,Steinbach,Kumar’s“IntroducLontoDataMining”slide

q HasLe,Trevor,etal.Theelementsofsta9s9callearning.Vol.2.No.1.NewYork:Springer,2009.

q Dr.OznurTastan’sslidesaboutRFandDT

11/9/16 67

Dr.YanjunQi/UVACS6316/f16