the course course - agroparistech€¦ · course « inductions » (a. cornuéjols) 3 / 79 course «...
TRANSCRIPT
AntoineCornuéjols
AgroParisTech–INRAMIA518
Course
Learning theory and
advanced Machine Learning
2 / 79 Course « InductionS » (A. Cornuéjols)
Thecourse
n Documents
– Le livre "L'apprentissage artificiel. Deep Learning, concepts et algorithmes" A. Cornuéjols & L. Miclet & V. Barra
Eyrolles. 3ème éd. 2018
– Les transparents + informations sur :
http://www2.agroparistech.fr/ufr-info/membres/cornuejols/Teaching/Master-AIC/M2-AIC-advanced-ML.html
3 / 79 Course « InductionS » (A. Cornuéjols) 4 / 79 Course « InductionS » (A. Cornuéjols)
Outlineofthecourse
Building an inductive criterion • Semi-supervised learning • Learning sparse models
Induction • How it works? • Which guarantees can we get? • The no-fre-lunch theorem
Online learning • Theory: new inductive criteria • In practice: heuristic inductive criteria • E.g. early classification of time series and LUPI
Transfer learning • Scenarios • Which information to exchange? • How to obtain guarantees?
Ensemble methods • What kinds of algorithms? • Which information to exchange? • And in the unsupervised case?
5 / 79 Course « InductionS » (A. Cornuéjols)
Course’sorganization
6Courses
1seminarlikesession:discussionofpapers
n 5quizz (5x5=25%)
n Project :50%
– 19/12/2019:descriptionofthechosenproject(2pages)
– 31/01/2020:mid-termreport(5à8pages)
– 28/02/2020:rapportfinal(10pagesstrict.FormatpapierICML)
n Criticalreviewofpapers :25%
A.Cornuéjols
AgroParisTech–INRAMIA518
ReflectionsonINDUCTION-S
http://www.agroparistech.fr/ufr-info/membres/cornuejols/Teaching/Master-AIC/M2-AIC-advanced-ML.html
7 / 79 Course « InductionS » (A. Cornuéjols)
Outline
1. Inductions
2. ThestatisticaltheoryofLearning
3. Otherscenarios
4. Theno-free-lunchtheorem
5. Explanation-Basedlearning:whatkindofvalidation?
6. Questions
8 / 79 Course « InductionS » (A. Cornuéjols)
9 / 79 Course « InductionS » (A. Cornuéjols)
Supervisedinduction
10 / 79 Course « InductionS » (A. Cornuéjols)
Learningbyheart
11 / 79 Course « InductionS » (A. Cornuéjols) 12 / 79 Course « InductionS » (A. Cornuéjols)
Whentherearefewdatapoints
n Learningatable
Example x1 x2 x3 x4 Label
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 1 1 1
4 1 0 0 1 1
5 0 1 1 0 0
6 1 1 0 0 0
7 0 1 0 1 0
13 / 79 Course « InductionS » (A. Cornuéjols)
Whenthereisahugenumberofdatapoints
n Learningafunctionf:x->y
Buthow?Whichfunction?
14 / 79 Course « InductionS » (A. Cornuéjols)
Supervisedlearning:
Simpleornotsosimple?
15 / 79 Course « InductionS » (A. Cornuéjols)
n Examplesdescribedusing:Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)
n Theybelongeithertoclass‘+’ortoclass‘-’
Oneexamplethattellsalot…
16 / 79 Course « InductionS » (A. Cornuéjols)
Description Your answer True answer
1largeredsquare -
1largegreensquare
2smallredsquares
2largeredcircles
1largegreencircle
1smallredcircle
1smallgreensquare
1smallredsquare
2largegreensquares
+
+
+
-
+
+
+
-
Yet another exercise n Examplesdescribedusing:
Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)
n Theybelongeithertoclass‘+’ortoclass‘-’
17 / 79 Course « InductionS » (A. Cornuéjols)
Description Your prediction True class
1 large red square -
n Examplesdescribedusing:Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)
1largegreensquare
2smallredsquares
2largeredcircles
1largegreencircle
1smallredcircle
+
+
+
-
+
Oneexamplethattellsalot…
HowmanypossiblefunctionsaltogetherfromXtoY?
Howmanyfunctionsdoremainafter8trainingexamples?
22=216=65,5364
26=1024
18 / 79 Course « InductionS » (A. Cornuéjols)
n Examplesdescribedusing:Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)
Oneexamplethattellsalot…
Description Yourprediction Trueclass1largeredsquare -1largegreensquare +2smallredsquares +2largeredcircles -1largegreencircle +1smallredcircle +1smallgreensquare -1smallredsquare +2largegreensquares +2smallgreensquares +2smallredcircles +1smallgreencircle -2largegreencircles -2smallgreencircles +1largeredcircle -2largeredsquares ?
Howmanyremainingfunctions?
15
?
19 / 79 Course « InductionS » (A. Cornuéjols)
Description Your prediction True class
1 large red square -
n Examplesdescribedusing:Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)
1largegreensquare
2smallredsquares
2largeredcircles
1largegreencircle
1smallredcircle
+
+
+
-
+
Oneexamplethattellsalot…
Howmanypossiblefunctionswith2descriptorsfromXtoY?
Howmanyfunctionsdoremainafter3≠trainingexamples?
22=24=162
21=2
20 / 79 Course « InductionS » (A. Cornuéjols)
Induction:animpossiblegame?
n Abiasisneeded
n Typesofbias
– Representationbias (declarative)
– Researchbias (procedural)
21 / 79 Course « InductionS » (A. Cornuéjols)
Interpretation–completionofpercepts
22 / 79 Course « InductionS » (A. Cornuéjols)
Interpretation–completionofpercepts
23 / 79 Course « InductionS » (A. Cornuéjols)
Interpretation–completionofpercepts
!"#$%&'()%*+,-./01%-2%3#4/1-5-64/701%85697,5%-:%#0;9.2%$97-<2/=-23%>?@%#91/11-A%
&)B'CB&'()%
ED%
6R(&,(-%"U8"-,"%&(7&'!"&"%%"-'.+#.%'&+%%81$'.(-&.-&;i&
!/2*29924#&'(&9#* '##&#"32*/'9*/'3* #1$*0'>$%* -$492Z"24-29*1/&-/*8%2'#(@*-$4#%&J"#23*#$*9/'?2*#/2*?'##2%4*%2-$84&#&$4*V2(3*&4*#/2*?'9#*A21*32-'329B**
!! &#* /'9* (23* #/2* -$00"4&#@* #$* A$-"9* 0'&4(@* $4* 7"+'8)"_*",'()&)"$)"%"-'+'.(-%+* 1/2%2+* 2'-/* $J>2-#* &9* 329-%&J23* &4* #2%09* $A* '*C2-#$%*$A*4"02%&-'(*'##%&J"#29*'43*&9*#/2%2A$%2*0'??23*#$*'*?$&4#*&4*'*S"-(&32'4*G82$02#%&-I*C2-#$%*9?'-2**
!! &#*/'9*(23*%292'%-/2%9*#$*0'&4#'&4*'*)"98,'.(-.%'&$(%.'.(-+*1/2%2J@*$J>2-#9*'%2*9224* &4* &9$('#&$4*'43*1/&-/*#/2%2A$%2*#2439*#$*$C2%($$.*#/2*%$(2*$A*-$4#2a#"'(+*$%*%2('#&$4'(+*&4A$%0'#&$4*
Y(-'"='&!"#$%&]&
!"#$%&'()%*+,-./01%-2%3#4/1-5-64/701%85697,5%-:%#0;9.2%$97-<2/=-23%>?@%#91/11-A%
&)B'CB&'()%
ED%
6R(&,(-%"U8"-,"%&(7&'!"&"%%"-'.+#.%'&+%%81$'.(-&.-&;i&
!/2*29924#&'(&9#* '##&#"32*/'9*/'3* #1$*0'>$%* -$492Z"24-29*1/&-/*8%2'#(@*-$4#%&J"#23*#$*9/'?2*#/2*?'##2%4*%2-$84&#&$4*V2(3*&4*#/2*?'9#*A21*32-'329B**
!! &#* /'9* (23* #/2* -$00"4&#@* #$* A$-"9* 0'&4(@* $4* 7"+'8)"_*",'()&)"$)"%"-'+'.(-%+* 1/2%2+* 2'-/* $J>2-#* &9* 329-%&J23* &4* #2%09* $A* '*C2-#$%*$A*4"02%&-'(*'##%&J"#29*'43*&9*#/2%2A$%2*0'??23*#$*'*?$&4#*&4*'*S"-(&32'4*G82$02#%&-I*C2-#$%*9?'-2**
!! &#*/'9*(23*%292'%-/2%9*#$*0'&4#'&4*'*)"98,'.(-.%'&$(%.'.(-+*1/2%2J@*$J>2-#9*'%2*9224* &4* &9$('#&$4*'43*1/&-/*#/2%2A$%2*#2439*#$*$C2%($$.*#/2*%$(2*$A*-$4#2a#"'(+*$%*%2('#&$4'(+*&4A$%0'#&$4*
Y(-'"='&!"#$%&]&
!"#$%&'()%*+,-./01%-2%3#4/1-5-64/701%85697,5%-:%#0;9.2%$97-<2/=-23%>?@%#91/11-A%
&)B'CB&'()%
ED%
6R(&,(-%"U8"-,"%&(7&'!"&"%%"-'.+#.%'&+%%81$'.(-&.-&;i&
!/2*29924#&'(&9#* '##&#"32*/'9*/'3* #1$*0'>$%* -$492Z"24-29*1/&-/*8%2'#(@*-$4#%&J"#23*#$*9/'?2*#/2*?'##2%4*%2-$84&#&$4*V2(3*&4*#/2*?'9#*A21*32-'329B**
!! &#* /'9* (23* #/2* -$00"4&#@* #$* A$-"9* 0'&4(@* $4* 7"+'8)"_*",'()&)"$)"%"-'+'.(-%+* 1/2%2+* 2'-/* $J>2-#* &9* 329-%&J23* &4* #2%09* $A* '*C2-#$%*$A*4"02%&-'(*'##%&J"#29*'43*&9*#/2%2A$%2*0'??23*#$*'*?$&4#*&4*'*S"-(&32'4*G82$02#%&-I*C2-#$%*9?'-2**
!! &#*/'9*(23*%292'%-/2%9*#$*0'&4#'&4*'*)"98,'.(-.%'&$(%.'.(-+*1/2%2J@*$J>2-#9*'%2*9224* &4* &9$('#&$4*'43*1/&-/*#/2%2A$%2*#2439*#$*$C2%($$.*#/2*%$(2*$A*-$4#2a#"'(+*$%*%2('#&$4'(+*&4A$%0'#&$4*
Y(-'"='&!"#$%&]&
24 / 79 Course « InductionS » (A. Cornuéjols)
Interprétation–complétiondepercepts
25 / 79 Course « InductionS » (A. Cornuéjols)
Opticalillusions
26 / 79 Course « InductionS » (A. Cornuéjols)
Inductionanditsillusions
Illustration
27 / 79 Course « InductionS » (A. Cornuéjols)
Clustering
28 / 79 Course « InductionS » (A. Cornuéjols)
Clustering
29 / 79 Course « InductionS » (A. Cornuéjols)
Inductioneverywhere
30 / 79 Course « InductionS » (A. Cornuéjols)
Theroleofinduction
n [LeslieValiant,«ProbablyApproximatelyCorrect.Nature’sAlgorithmsforLearningandProsperinginaComplexWorld»,BasicBooks,2013]
«Fromthis,wehavetoconcludethatgeneralizationorinductionis
apervasivephenomenon(…).Itisasroutineandreproduciblea
phenomenonasobjectsfallingundergravity.
Itisreasonabletoexpectaquantitativescientificexplanation
ofthishighlyreproduciblephenomenon.»
31 / 79 Course « InductionS » (A. Cornuéjols)
Theroleofinduction
n [EdwinT.Jaynes,«Probabilitytheory.Thelogicofscience»,CambridgeU.Press,2003],p.3
«Wearehardlyabletogetthroughonewakinghourwithoutfacingsome
situation(e.g.willitrainorwon’tit?)wherewedonothaveenough
informationtopermitdeductivereasoning;butstillwemustdecide
immediately.
Inspiteofitsfamiliarity,theformationofplausibleconclusionsisavery
subtleprocess.»
32 / 79 Course « InductionS » (A. Cornuéjols)
Sequences
n 1123581321…
n 1235...
n 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 3 1 2 2 1 1 …
– Comment?
– Pourquoiserait-ilpossibledefairedel’induction?
– Est-cequ’unexemplesupplémentaire
doitaugmenterlaconfiancedanslarègleinduite?
– Combienfaut-ild’exemples?
33 / 79 Course « InductionS » (A. Cornuéjols)
Supervisedinduction
n Howtochosethedecisionfunction?
x
y
34 / 79 Course « InductionS » (A. Cornuéjols)
Interrogations
Eachtime:
Specificcases=>generallaworadaptationtoanewcase
1. Howthisgeneralizationisallowed?
2. Canweguaranteesomething?
35 / 79 Course « InductionS » (A. Cornuéjols)
Whatkindoftheoreticalguarantees
oninductioncanweget?
36 / 79 Course « InductionS » (A. Cornuéjols)
Analysisoftheperceptron
37 / 79 Course « InductionS » (A. Cornuéjols)
Theperceptron
{ biais
x
y
w1w2
w3 w4 w5
w0
wd
1
x1 x2 x3 x4 x5 xd
x0
neurone de biais
1
yi
x(1)
x(2)
x(3)
x(d)
w1i
w2i
w3i
wdi
σ(i) =d!
j=0
wjix(j)w0i
– Rosenblatt (1958-1962)
38 / 79 Course « InductionS » (A. Cornuéjols)
Theperceptron:alineardiscriminant
w
39 / 79 Course « InductionS » (A. Cornuéjols)
Theperceptron
n Learningtheweights– Principle(Hebb’srule):incaseofsuccess,addtoeachweight
(connection)somevalueproportionaltotheinputandoutput
Perceptron’srule:learnonlyincaseoffailure
+
40 / 79 Course « InductionS » (A. Cornuéjols)
Propertiesthatareremarquable!!
n Convergenceinafinitenumberofsteps
– Independentlyofthenumberofexamples
– Independentlyofthedistributionofexamples
– (quasi)independentlyofthedimensionoftheinputspace
Ifthereexistsatleastonelinearseparatriceoftheexamples
!!!
41 / 79 Course « InductionS » (A. Cornuéjols)
Guaranteesaboutgeneralizing??
n Theoremsovertheperformance
withrespecttothetrainingsample
n Butwhataboutfutureexamples?
42 / 79 Course « InductionS » (A. Cornuéjols)
– Rosenblatt(1958-1962)
ThePerceptron
43 / 79 Course « InductionS » (A. Cornuéjols)
PAClearning
ProbablyApproximativelyCorrect
44 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:rectanglesinR2
n Sample
– Positiveinstances
– Negativeinstances
P+X
P�X
x
y
45 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:unknown
n Whatdowewanttolearn?
Adecisionfonction(prediction)
x
y
!
46 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:unknown
n Howtolearn?
x
y
47 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:rectanglesinR2
n Howtolearn?
– IfIknowthatthetargetconceptisarectangle
x
y
48 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:rectanglesinR2
n Howtolearn?
– IfIknowthatthetargetconceptisarectangle
x
y
Most general hypotheses
49 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:rectanglesinR2
n Howtolearn?
– IfIknowthatthetargetconceptisarectangle
x
y
Most specific hypotheses
50 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:rectanglesinR2
n Howtolearn?
– Choiceofonehypothesish
Version
space
51 / 79 Course « InductionS » (A. Cornuéjols)
Targetclass:rectanglesinR2
n Learning:choicedeh
– Whichperformancetoexpect?
x
y
h
52 / 79 Course « InductionS » (A. Cornuéjols)
Thestatisticaltheoryoflearning
Whichperformance?
n Costforapredictionerror
– Thelossfunction
n WhichexpectedcostifIchooseh?
– The«realrisk»(ortruerisk)
R(h) =�
X�Y��h(x), y
�pXY(x, y) dx dy
��h(x), y
�
53 / 79 Course « InductionS » (A. Cornuéjols)
Thestatisticaltheoryoflearning
n Whichexpectedcostwhenhischosen?
– AssumingthatthereisnotrainingerroronS
x
y
h
The«empiricalrisk»
R(h) =1m
m�
i=1
��h(xi), yi
�
54 / 79 Course « InductionS » (A. Cornuéjols)
Statisticaltheoryoflearning:theERM
n Learningstrategy:
– Selectanhypothesiswithnullempiricalrisk(notrainingerror)
– Whichgeneralizationperformancetoexpectforh?
x
y
h
x
y
f
h
55 / 79 Course « InductionS » (A. Cornuéjols)
Statisticaltheoryoflearning:theERM
– Selectanhypothesiswithnullempiricalrisk(notrainingerror)
– Whichgeneralizationperformancetoexpectforh?
– WhatistheriskofgettingerrorR(h)>ε?
x
y
f
h
h � f
x
y
f
h
56 / 79 Course « InductionS » (A. Cornuéjols)
Centralinterrogation:theinductiveprinciple
n Theempiricalriskminimizationprinciple(ERM)
…isitsound?
– IfIchosehsuchthat
– Ishgoodwithrespecttotherealrisk?
– CouldIhavedonemuchbetter?
R(h)? ! R(h)
57 / 79 Course « InductionS » (A. Cornuéjols)
Thestatisticaltheoryoflearning
The1erstep
Onehypothesis
58 / 79 Course « InductionS » (A. Cornuéjols)
StatisticalstudyforONEhypothesis
– Choseonehypothesisofnulempiricalrisk(noerroronthetrainingsetS)
– Whichperformancecanweexpectforh?
– WhatistheriskofhavingR(h)>ε?
x
y
f
h
h � f
x
y
f
h
59 / 79 Course « InductionS » (A. Cornuéjols)
StatisticalstudyforONEhypothesis
n Assumethathst.(his«bad»)
n Whatistheprobabilitythatnonethelesshhavebeenselected?
x
y
f
h
h � f
R(h) � �
R(h) = pX (h � f)
Afteroneexample: p�R(h
�= 0) � 1� �
Aftermexamples(i.i.d.):
pm�R(h
�= 0) � (1� �)m
Wewant: � ⇥, � � [0, 1] : pm�R(h
�� ⇥) � �
«falls»outside h � f
60 / 79 Course « InductionS » (A. Cornuéjols)
StatisticalstudyforONEhypothesis
n Wewant:
x
y
f
h
h � f
Or:
Hence:
� ⇥, � � [0, 1] : pm�R(h
�� ⇥) � �
(1 � �)m � �
e�� m � �
�⇥ m � ln(�)
m � ln(1/�)⇥
<
61 / 79 Course « InductionS » (A. Cornuéjols)
Thestatisticaltheoryoflearning
The2ndstep
Whichhypothesisinthecrowd
62 / 79 Course « InductionS » (A. Cornuéjols)
Statisticalstudyfor|H|hypotheses
n WhatistheprobabilitythatIchoseonehypothesisherrofrealrisk>εandthatIdonotrealizeitaftermexamples?
n Probabilityofsurvivalofherrafter1example:
n Probabilityofsurvivalofherraftermexamples:
n ProbabilityofsurvivalofatleastonehypothesisinH:
– Weusetheprobabilityoftheunion
n Wewantthattheprobabilitythatthereremainsatleastonehypothesisofrealrisk>εintheversionspacebeboundedbyδ:
63 / 79 Course « InductionS » (A. Cornuéjols)
The«PAClearning»analysis
n Weget:
=0
Realizablecase:thereexistsatleastonefunctionhofrisk0
TheEmpiricalRiskMinimizationprinciple
issoundonlyifthereareconstraintsonthehypothesisspace
64 / 79 Course « InductionS » (A. Cornuéjols)
PAClearning:definition
n Worstcaseanalysis
– AgainstalldistributionsP
– Foranytargethypothesisinaclassofhypotheses
n Notionofcomputationalcomplexity
Given 0 < �, " < 1, a concept class C is learnable by a polynomial time algorithm A if,
for any distribution P of samples and any concept c 2 C,
there exists a polynomial p(·, ·, ·) such that
A will produce with probability at least 1� � a hypothesis h 2 C whose error is "
when given at least p(m, 1/�, 1") independent random examples drawn according to P .
[Valiant,1984]
65 / 79 Course « InductionS » (A. Cornuéjols)
Thestatisticaltheoryoflearning
Uniformconvergencebounds
(fortheunrealizablecase)
66 / 79 Course « InductionS » (A. Cornuéjols)
Generalizingthelawoflargenumbers:uniformconvergence
Theoreme 1 (Inegalite de Hoe�ding). Si les �i sont des variables aleatoires,tirees independamment et selon une meme distribution et prenant leurvaleur dans l’intervalle [a, b], alors :
P
�����1m
m�
i=1
�i � E(�)���� � ⇥
�� 2 exp
�� 2 m ⇥2
(b� a)2
�
Appliquee au risque empirique et au risque reel, cette inegalite nous donne :
P�|REmp(h)�RReel(h)| ⇤ �
�⇥ 2 exp
�� 2 m �2
(b� a)2�
(1)
si la fonction de perte ⇥ est definie sur l’intervalle [a, b].
Pm[⌅h ⇤ H : RReel(h)�REmp(h) > ⇥] ⇥|H|�
i=1
Pm[RReel(hi)�REmp(hi) > ⇥]
⇥ |H| exp(�2 m ⇥2) = �
en supposant ici que la fonction de perte ⇤ prend ses valeurs dans l’intervalle[0, 1].
« H finite »
67 / 79 Course « InductionS » (A. Cornuéjols)
Boundingthetrueriskwiththeempiricalrisk+…
n Hfinite,realizablecase
n Hfinite,nonrealizablecase
⌅h ⇤ H,⌅� ⇥ 1 : Pm
�RReel(h) ⇥ REmp(h) +
�log |H|+ log 1
�
2 m
�> 1� �
⌅h ⇤ H,⌅� ⇥ 1 : Pm
�RReel(h) ⇥ REmp(h) +
log |H|+ log 1�
m
�> 1� �
68 / 79 Course « InductionS » (A. Cornuéjols)
Tosumup:for|H|finite
n Nonrealizablecase
� =
�log |H|+ log 1
�
2 m and
� =log |H|+ log 1
�
mm �
log |H|+ log 1�
�
m �log |H|+ log 1
�
2 �2
n Realizablecase
and
69 / 79 Course « InductionS » (A. Cornuéjols)
|H|infinite!!
n EffectivedimensionofH=theVapnik-Chervonenkisdimension
– Combinatorialcriterion
– Sizeofthelargestsetofpoints(ingeneralconfiguration)thatcanbelabeledinanywaybyhypothesesdrawnfrom H
Boundonthetruerisk
dV C(H) = max�m : �H(m) = 2m
�
⌅h ⇤ H,⌅� ⇥ 1 : Pm
�RReel(h) ⇥ REmp(h) +
�8 dV C(H) log 2 e m
dV C(H) + 8 log 4�
m
�> 1� �
70 / 79 Course « InductionS » (A. Cornuéjols)
VCdim:illustrations
n dVC(linearseparator)=?
+
+ -
+
+
--
+
+
-
+
+
(a) (b) (c)
• dVC(rectangles) = ?
+
+
-- +
+
-
++
+
-
+
+
-
(a) (b) (c) (d)
+
71 / 79 Course « InductionS » (A. Cornuéjols)
Lesson
n Youcannotguaranteeanythingaboutinduction
n Evenifyouassumethattheworldisstationary
andexamplesarei.i.d.
n Unlessthereare(severe)constraintsonthehypothesisspace
Butwait…?
72 / 79 Course « InductionS » (A. Cornuéjols)
TheSuperVisionnetwork
Imageclassificationwithdeepconvolutionalneuralnetworks
– 7hidden“weight”layers
– 650Kneurons
– 60Mparameters
– 630Mconnections
SuperVision (SV)
Image classification with deep convolutional neural networks • 7 hidden “weight” layers • 650K neurons • 60M parameters • 630M connections
• Rectified Linear Units, overlapping pooling, dropout trick • Randomly extracted 224x224 patches for more data
h-p://image4net.org/challenges/LSVRC/2012/supervision.pdf+
Signal
73 / 79 Course « InductionS » (A. Cornuéjols)
GoogleNet
n Amécanoofneuralnetworks
1x1 semblent triviaux car ils ne permettent pas de reduire la dimension de l’entree, mais son criterenon-lineaire lui permet de complexifier la nature des attributs detectes et donc de voir des motifs pluscomplexes. Network in Network introduit aussi l’utilisation de reseau completement constitue par descouches convolutives, en remplacant les couches de classification par des filtres 1x1 (Figure 10).
FIGURE 10. Module Network in Network [33]
GoogleNet [58] une des architectures les plus utilisees (avec AlexNet) de part ses performances.Developpe par Google et gagnant du l’ILSVRC 2014, le modele se differencie des autres par sa com-plexite (22 couches contre 8 pour AlexNet) et l’utilisation de module inception (Figure 11). Le moduled’inception (Figure 12) est une configuration permettant d’appliquer plusieurs filtres de tailles differentesen parallele. La parallelisation et l’application de multiples filtres permettent d’apprendre plusieurs lo-giques d’extraction d’attributs, allant sur des details precis pour les filtres 1x1 jusqu’a des formes pluslarges pour les filtres 5x5.
FIGURE 11. Architecture du reseau GoogleNet [58]
9
74 / 79 Course « InductionS » (A. Cornuéjols)
Troublingfindings
Apaper– C.Zhang,S.Bengio,M.Hardt,B.Recht,O.Vinyals(ICLR,May2017).
“Understandingdeeplearningrequiresrethinkinggeneralization”
Extensiveexperimentsontheclassificationofimages
– TheAlexNet(>1,000,000parameters)+2otherarchitectures
– TheCIFAR-10dataset:• 60,000imagescategorizedin10classes(50,000fortrainingand10,000fortesting)
• Images:32x32pixelsin3colorchannels
Again, on intuitive grounds we expect that in order to make good predic-tions we need to select a hypothesis class F that is appropriate for the problemat hand. More precisely we should use some prior knowledge about the natureof the link between between the features x and the target y to choose whichfunctions the class F should possess. For instance if, for any reason, we knowthat with high probability the relation between x and y is approximately lin-ear we better choose F to contain only such functions fw(x) = w · x. In themost general setting this relationship is encoded in a complicated and unknownprobability distribution P on labeled observations (x, y). In many cases all weknow is that the relation between x and y has some smoothness properties.
The set of techniques that data scientists use to adapt the hypothesis classF to a specific problem is know as regularization. Some of these are explicit inthe sense that they constrain estimators f in some way as we shall describe insection 2. Some are implicit meaning that it is the dynamics of the algorithmwhich walks its way through the set F in search for a good f (typically usingstochastic gradient descent) that provides the regularization. Some of theseregularization techniques actually pertain more to art than to mathematics asthey rely more on experience and intuition than on theorems.
Figure 1: The architecture of AlexNet which is one of the networks used by the authors
in [1]
Deep Learning is a a very popular class of machine learning models, roughlyinspired by biology, that are particularly well suited for tackling complex, AI-like tasks such as image classification, NLP or automatic translation. Roughlyspeaking these models are defined by stacking layers that, each, combine linearcombinations of the input with non-linear activation functions (and perhapssome regularization). We won’t enter into defining them in detail here as manyexcellent textbooks [3, 4] will do the job. Figure 1 shows the architecture ofAlexNet a deep network used in the experiment [1]. For our purpose, which is adiscussion of the issue of generalization and regularization, su�ce it to say herethat these Deep Learning problems share the following facts:
• The number n of samples available for training these networks is typicallymuch smaller than the number k of parameters w = (w1, . . . , wk) thatdefine the functions fw 2 F
1.
• The probability distribution P (x, y) is impossible to describe in any sen-sible way in practice. For concreteness, think of x as the pixels of and
1The number of parameters k of a Deep Learning network such as AlexNet can be over ahundred of millions while being trained on “only” a few millions of images in image-net.
2
75 / 79 Course « InductionS » (A. Cornuéjols)
Troublingfindings
Experiments
1. Originaldatasetwithoutmodification
• Results?– Trainingaccuracy=100%;Testaccuracy=89%– Speedofconvergence~5,000steps
76 / 79 Course « InductionS » (A. Cornuéjols)
Troublingfindings
Experiments
1. Originaldatasetwithoutmodification
• Results?– Trainingaccuracy=100%;Testaccuracy=89%– Speedofconvergence~5,000steps
Expectedbehaviorifthecapacityofthehypothesisspaceislimited
i.e.thesystemcannotfitany(arbitrary)trainingdata
8h 2 H, 8� 1 : Pm
"R(h) bR(h) + 2 dRadm(H) + 3
rln(2/�)
m
#> 1� �
77 / 79 Course « InductionS » (A. Cornuéjols)
Troublingfindings
Experiments
1. Originaldatasetwithoutmodification
• Results?– Trainingaccuracy=100%;Testaccuracy=89%– Speedofconvergence~5,000steps
2. Randomlabels– Trainingaccuracy=100%!!??;Testaccuracy=9.8%
– Speedofconvergence=similarbehavior(~10,000steps)
!!!
78 / 79 Course « InductionS » (A. Cornuéjols)
Troublingfindings
Experiments
1. Originaldatasetwithoutmodification
• Results?– Trainingaccuracy=100%;Testaccuracy=89%– Speedofconvergence~5,000steps
2. Randomlabels– Trainingaccuracy=100%!!??;Testaccuracy=9.8%– Speedofconvergence=similarbehavior(~10,000steps)
3. Randompixels– Trainingaccuracy=100%!!??;Testaccuracy~10%– Speedofconvergence=similarbehavior(~10,000steps)
Now, we are in
trouble!!
79 / 79 Course « InductionS » (A. Cornuéjols)
Troublingfindings
n DeepNNscanaccommodateANYtrainingset
Can grow without limit!!
But then,
why are deep NNs so good on image classification tasks?
8h 2 H, 8� 1 : Pm
"R(h) bR(h) + 2 dRadm(H) + 3
rln(2/�)
m
#> 1� �