computational learning theory: occam’s razor learning fall 2017 computational learning theory:...

30
Machine Learning Fall 2017 Computational Learning Theory: Occam’s Razor 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

Upload: lekhuong

Post on 09-Mar-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

MachineLearningFall2017

ComputationalLearningTheory:Occam’sRazor

1SlidesbasedonmaterialfromDanRoth,Avrim Blum,TomMitchellandothers

Page 2: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Thislecture:ComputationalLearningTheory

• TheTheoryofGeneralization

• ProbablyApproximatelyCorrect(PAC)learning

• Positiveandnegativelearnabilityresults

• AgnosticLearning

• ShatteringandtheVCdimension

2

Page 3: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Wherearewe?

• TheTheoryofGeneralization– Whencanbetrustthelearningalgorithm?– Whatfunctionscanbelearned?– BatchLearning

• ProbablyApproximatelyCorrect(PAC)learning

• Positiveandnegativelearnabilityresults

• AgnosticLearning

• ShatteringandtheVCdimension

3

Page 4: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Thissection

1. Analyzeasimplealgorithmforlearningconjunctions

2. DefinethePACmodeloflearning

3. MakeformalconnectionstotheprincipleofOccam’srazor

4

Page 5: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Thissection

ü Analyzeasimplealgorithmforlearningconjunctions

ü DefinethePACmodeloflearning

3. MakeformalconnectionstotheprincipleofOccam’srazor

5

Page 6: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

NamedafterWilliamofOccam– AD1300s

Prefersimplerexplanationsovermorecomplexones

“Numquam ponenda est pluralitas sinenecessitate”

Historically,awidelyprevalentideaacrossdifferentschoolsofphilosophy

6

(Neverpositpluralitywithoutnecessity.)

Page 7: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

7

Page 8: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

8

(Assumingconsistency)

Page 9: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

9

(Assumingconsistency)

Thatis,consistentyetbad

Page 10: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

10

(Assumingconsistency)

Thatis,consistentyetbad

Page 11: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

11

(Assumingconsistency)

Thatis,consistentyetbad

Page 12: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

12

(Assumingconsistency)

Thatis,consistentyetbad

Page 13: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

TowardsformalizingOccam’sRazor

Claim:Theprobabilitythatthereisahypothesish2 Hthat:1. IsConsistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Proof:Lethbesuchabadhypothesisthathasanerror>²ProbabilitythathisconsistentwithoneexampleisPr[f(x)=h(x)]<1- ²

ThetrainingsetconsistsofmexamplesdrawnindependentlySo,probabilitythathisconsistentwithmexamples<(1- ²)m

Probabilitythatsomebadhypothesis inHisconsistentwithmexamplesislessthan|H|(1- ²)m

13

(Assumingconsistency)

Thatis,consistentyetbad

Page 14: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±

ln(|H|)+mln(1- ²)<ln ±

WeknowthatLet’suseln(1- ²) <-² togetasafer±

14

Thatis,if then,theprobabilityofgettingabadhypothesisissmall

Page 15: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±

ln(|H|)+mln(1- ²)<ln ±

WeknowthatLet’suseln(1- ²) <-² togetasafer±

15

Thatis,if then,theprobabilityofgettingabadhypothesisissmall

Page 16: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±

ln(|H|)+mln(1- ²)<ln ±

WeknowthatLet’suseln(1- ²) <-² togetasafer±

16

Thatis,if then,theprobabilityofgettingabadhypothesisissmall

Page 17: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±

ln(|H|)+mln(1- ²)<ln ±

WeknowthatLet’suseln(1- ²) <-² togetasafer±

17

Thatis,if then,theprobabilityofgettingabadhypothesisissmall

Page 18: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±

ln(|H|)+mln(1- ²)<ln ±

WeknowthatLet’suseln(1- ²) <-² togetasafer±

18

Thatis,if then,theprobabilityofgettingabadhypothesisissmall

Page 19: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

Theprobabilitythatthereisahypothesish2 Hthatis1. Consistentwithmexamples,and2. HaserrD(h)>²islessthan|H|(1- ²)m

Justlikebefore,wewanttomakethisprobabilitysmall,saysmallerthan±|H|(1- ²)m<±

ln(|H|)+mln(1- ²)<ln ±

WeknowthatLet’suseln(1- ²) <-² togetasafer±

19

Thatis,if then,theprobabilityofgettingabadhypothesisissmall

Page 20: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

20

Page 21: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

21

1.Expectinglowererrorincreasessamplecomplexity(i.e moreexamplesneededfortheguarantee)

Page 22: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

22

1.Expectinglowererrorincreasessamplecomplexity(i.e moreexamplesneededfortheguarantee)

2.Ifwehavealargerhypothesisspace,thenwewillmakelearningharder(i.e highersamplecomplexity)

Page 23: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

23

1.Expectinglowererrorincreasessamplecomplexity(i.e moreexamplesneededfortheguarantee)

2.Ifwehavealargerhypothesisspace,thenwewillmakelearningharder(i.e highersamplecomplexity)

3.Ifwewantahigherconfidenceintheclassifierwewillproduce,samplecomplexitywillbehigher.

Page 24: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

24

Page 25: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

25

Page 26: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

26

Page 27: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Occam’sRazor

LetHbeanyhypothesisspace.Withprobability1- ±,ahypothesish2 Hthatisconsistent withatrainingsetofsizem willhaveanerror<² onfutureexamplesif

ThisiscalledtheOccam’sRazorbecauseitexpressesapreferencetowardssmallerhypothesisspaces.

Showswhenam-consistenthypothesisgeneralizeswell(i.e error<²).

Complicated/largerhypothesisspacesarenotnecessarilybad.Butsimpleronesareunlikelytofoolusbybeingconsistentwithmanyexamples!

27

Page 28: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Consistent LearnersandOccam’sRazorFromthedefinition,wegetthefollowinggeneralschemeforPAClearning

28

GivenasampleDofmexamples• FindsomehÎ H thatisconsistentwithallmexamples

• Ifmislargeenough,aconsistenthypothesismustbecloseenoughtof

• Checkthatmdoesnothavetobetoolarge(i.e polynomialintherelevantparameters):weshowedthatthe“closeness”guaranteerequiresthat

m>1/² (ln |H|+ln 1/±)

• ShowthattheconsistenthypothesishÎ H canbecomputedefficiently

Page 29: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Consistent LearnersandOccam’sRazorFromthedefinition,wegetthefollowinggeneralschemeforPAClearning

Weworkedoutthedetailsforconjunctions• TheEliminationalgorithmtofindahypothesishthatisconsistentwiththetraining

set(easytocompute)• Weshoweddirectlythatifwehavesufficientlymanyexamples(polynomialinthe

parameters),thanhisclosetothetargetfunction.29

GivenasampleDofmexamples• FindsomehÎ H thatisconsistentwithallmexamples

• Ifmislargeenough,aconsistenthypothesismustbecloseenoughtof

• Checkthatmdoesnothavetobetoolarge(i.e polynomialintherelevantparameters):weshowedthatthe“closeness”guaranteerequiresthat

m>1/² (ln |H|+ln 1/±)

• ShowthattheconsistenthypothesishÎ H canbecomputedefficiently

Page 30: Computational Learning Theory: Occam’s Razor Learning Fall 2017 Computational Learning Theory: Occam’s Razor Slides based on material from Dan Roth, AvrimBlum ... (1 -²)m Just

Exercises

Wehaveseenthedecisiontreelearningalgorithm.Supposeourproblemhasnbinaryfeatures.Whatisthesizeofthehypothesisspace?

AredecisiontreesefficientlyPAClearnable?

30