the course course - agroparistech€¦ · course « inductions » (a. cornuéjols) 3 / 79 course «...

AntoineCornuéjols

AgroParisTech–INRAMIA518

[email protected]

Course

Learning theory and

advanced Machine Learning

2 / 79 Course « InductionS » (A. Cornuéjols)

Thecourse

n  Documents

–  Le livre "L'apprentissage artificiel. Deep Learning, concepts et algorithmes" A. Cornuéjols & L. Miclet & V. Barra

Eyrolles. 3ème éd. 2018

–  Les transparents + informations sur :

http://www2.agroparistech.fr/ufr-info/membres/cornuejols/Teaching/Master-AIC/M2-AIC-advanced-ML.html

3 / 79 Course « InductionS » (A. Cornuéjols) 4 / 79 Course « InductionS » (A. Cornuéjols)

Outlineofthecourse

Building an inductive criterion • Semi-supervised learning • Learning sparse models

Induction • How it works? • Which guarantees can we get? • The no-fre-lunch theorem

Online learning • Theory: new inductive criteria • In practice: heuristic inductive criteria • E.g. early classification of time series and LUPI

Transfer learning • Scenarios • Which information to exchange? • How to obtain guarantees?

Ensemble methods • What kinds of algorithms? • Which information to exchange? • And in the unsupervised case?


Course’sorganization

6Courses

1seminarlikesession:discussionofpapers

n  5quizz (5x5=25%)

n  Project :50%

–  19/12/2019:descriptionofthechosenproject(2pages)

–  31/01/2020:mid-termreport(5à8pages)

–  28/02/2020:rapportfinal(10pagesstrict.FormatpapierICML)

n  Criticalreviewofpapers :25%

A.Cornuéjols

AgroParisTech–INRAMIA518

ReflectionsonINDUCTION-S

http://www.agroparistech.fr/ufr-info/membres/cornuejols/Teaching/Master-AIC/M2-AIC-advanced-ML.html


Outline

1.   Inductions

2.  ThestatisticaltheoryofLearning

3.  Otherscenarios

4.  Theno-free-lunchtheorem

5.   Explanation-Basedlearning:whatkindofvalidation?

6.  Questions



Supervisedinduction


Learningbyheart

11 / 79 Course « InductionS » (A. Cornuéjols) 12 / 79 Course « InductionS » (A. Cornuéjols)

Whentherearefewdatapoints

n  Learningatable

Example x1 x2 x3 x4 Label

1 0 0 1 0 0

2 0 1 0 0 0

3 0 0 1 1 1

4 1 0 0 1 1

5 0 1 1 0 0

6 1 1 0 0 0

7 0 1 0 1 0


Whenthereisahugenumberofdatapoints

n  Learningafunctionf:x->y

Buthow?Whichfunction?


Supervisedlearning:

Simpleornotsosimple?


n  Examplesdescribedusing:Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)

n  Theybelongeithertoclass‘+’ortoclass‘-’

Oneexamplethattellsalot…


Description Your answer True answer

1largeredsquare -

1largegreensquare

2smallredsquares

2largeredcircles

1largegreencircle

1smallredcircle

1smallgreensquare

1smallredsquare

2largegreensquares

+

+

+

-

+

+

+

-

Yet another exercise n  Examplesdescribedusing:

Number(1or2);size(smallorlarge);shape(circleorsquare);color(redorgreen)

n  Theybelongeithertoclass‘+’ortoclass‘-’


Description Your prediction True class

1 large red square -


1largegreensquare

2smallredsquares

2largeredcircles

1largegreencircle

1smallredcircle

+

+

+

-

+


HowmanypossiblefunctionsaltogetherfromXtoY?

Howmanyfunctionsdoremainafter8trainingexamples?

22=216=65,5364

26=1024




Description Yourprediction Trueclass1largeredsquare -1largegreensquare +2smallredsquares +2largeredcircles -1largegreencircle +1smallredcircle +1smallgreensquare -1smallredsquare +2largegreensquares +2smallgreensquares +2smallredcircles +1smallgreencircle -2largegreencircles -2smallgreencircles +1largeredcircle -2largeredsquares ?

Howmanyremainingfunctions?

15

?


Description Your prediction True class

1 large red square -


1largegreensquare

2smallredsquares

2largeredcircles

1largegreencircle

1smallredcircle

+

+

+

-

+


Howmanypossiblefunctionswith2descriptorsfromXtoY?

Howmanyfunctionsdoremainafter3≠trainingexamples?

22=24=162

21=2


Induction:animpossiblegame?

n  Abiasisneeded

n  Typesofbias

–  Representationbias (declarative)

–  Researchbias (procedural)


Interpretation–completionofpercepts





!"#$%&'()%*+,-./01%-2%3#4/1-5-64/701%85697,5%-:%#0;9.2%$97-<2/=-23%>?@%#91/11-A%

&)B'CB&'()%

ED%

6R(&,(-%"U8"-,"%&(7&'!"&"%%"-'.+#.%'&+%%81$'.(-&.-&;i&

!/2*29924#&'(&9#* '##&#"32*/'9*/'3* #1$*0'>$%* -$492Z"24-29*1/&-/*8%2'#(@*-$4#%&J"#23*#$*9/'?2*#/2*?'##2%4*%2-$84&#&$4*V2(3*&4*#/2*?'9#*A21*32-'329B**

!! &#* /'9* (23* #/2* -$00"4&#@* #$* A$-"9* 0'&4(@* $4* 7"+'8)"_*",'()&)"$)"%"-'+'.(-%+* 1/2%2+* 2'-/* $J>2-#* &9* 329-%&J23* &4* #2%09* $A* '*C2-#$%*$A*4"02%&-'(*'##%&J"#29*'43*&9*#/2%2A$%2*0'??23*#$*'*?$&4#*&4*'*S"-(&32'4*G82$02#%&-I*C2-#$%*9?'-2**

!! &#*/'9*(23*%292'%-/2%9*#$*0'&4#'&4*'*)"98,'.(-.%'&$(%.'.(-+*1/2%2J@*$J>2-#9*'%2*9224* &4* &9$('#&$4*'43*1/&-/*#/2%2A$%2*#2439*#$*$C2%($$.*#/2*%$(2*$A*-$4#2a#"'(+*$%*%2('#&$4'(+*&4A$%0'#&$4*

Y(-'"='&!"#$%&]&

!"#$%&'()%*+,-./01%-2%3#4/1-5-64/701%85697,5%-:%#0;9.2%$97-<2/=-23%>?@%#91/11-A%

&)B'CB&'()%

ED%

6R(&,(-%"U8"-,"%&(7&'!"&"%%"-'.+#.%'&+%%81$'.(-&.-&;i&

!/2*29924#&'(&9#* '##&#"32*/'9*/'3* #1$*0'>$%* -$492Z"24-29*1/&-/*8%2'#(@*-$4#%&J"#23*#$*9/'?2*#/2*?'##2%4*%2-$84&#&$4*V2(3*&4*#/2*?'9#*A21*32-'329B**

!! &#* /'9* (23* #/2* -$00"4&#@* #$* A$-"9* 0'&4(@* $4* 7"+'8)"_*",'()&)"$)"%"-'+'.(-%+* 1/2%2+* 2'-/* $J>2-#* &9* 329-%&J23* &4* #2%09* $A* '*C2-#$%*$A*4"02%&-'(*'##%&J"#29*'43*&9*#/2%2A$%2*0'??23*#$*'*?$&4#*&4*'*S"-(&32'4*G82$02#%&-I*C2-#$%*9?'-2**

!! &#*/'9*(23*%292'%-/2%9*#$*0'&4#'&4*'*)"98,'.(-.%'&$(%.'.(-+*1/2%2J@*$J>2-#9*'%2*9224* &4* &9$('#&$4*'43*1/&-/*#/2%2A$%2*#2439*#$*$C2%($$.*#/2*%$(2*$A*-$4#2a#"'(+*$%*%2('#&$4'(+*&4A$%0'#&$4*

Y(-'"='&!"#$%&]&

!"#$%&'()%*+,-./01%-2%3#4/1-5-64/701%85697,5%-:%#0;9.2%$97-<2/=-23%>?@%#91/11-A%

&)B'CB&'()%

ED%

6R(&,(-%"U8"-,"%&(7&'!"&"%%"-'.+#.%'&+%%81$'.(-&.-&;i&

!/2*29924#&'(&9#* '##&#"32*/'9*/'3* #1$*0'>$%* -$492Z"24-29*1/&-/*8%2'#(@*-$4#%&J"#23*#$*9/'?2*#/2*?'##2%4*%2-$84&#&$4*V2(3*&4*#/2*?'9#*A21*32-'329B**

!! &#* /'9* (23* #/2* -$00"4&#@* #$* A$-"9* 0'&4(@* $4* 7"+'8)"_*",'()&)"$)"%"-'+'.(-%+* 1/2%2+* 2'-/* $J>2-#* &9* 329-%&J23* &4* #2%09* $A* '*C2-#$%*$A*4"02%&-'(*'##%&J"#29*'43*&9*#/2%2A$%2*0'??23*#$*'*?$&4#*&4*'*S"-(&32'4*G82$02#%&-I*C2-#$%*9?'-2**

!! &#*/'9*(23*%292'%-/2%9*#$*0'&4#'&4*'*)"98,'.(-.%'&$(%.'.(-+*1/2%2J@*$J>2-#9*'%2*9224* &4* &9$('#&$4*'43*1/&-/*#/2%2A$%2*#2439*#$*$C2%($$.*#/2*%$(2*$A*-$4#2a#"'(+*$%*%2('#&$4'(+*&4A$%0'#&$4*

Y(-'"='&!"#$%&]&


Interprétation–complétiondepercepts


Opticalillusions


Inductionanditsillusions

Illustration


Clustering


Clustering


Inductioneverywhere


Theroleofinduction

n  [LeslieValiant,«ProbablyApproximatelyCorrect.Nature’sAlgorithmsforLearningandProsperinginaComplexWorld»,BasicBooks,2013]

«Fromthis,wehavetoconcludethatgeneralizationorinductionis

apervasivephenomenon(…).Itisasroutineandreproduciblea

phenomenonasobjectsfallingundergravity.

Itisreasonabletoexpectaquantitativescientificexplanation

ofthishighlyreproduciblephenomenon.»


Theroleofinduction

n  [EdwinT.Jaynes,«Probabilitytheory.Thelogicofscience»,CambridgeU.Press,2003],p.3

«Wearehardlyabletogetthroughonewakinghourwithoutfacingsome

situation(e.g.willitrainorwon’tit?)wherewedonothaveenough

informationtopermitdeductivereasoning;butstillwemustdecide

immediately.

Inspiteofitsfamiliarity,theformationofplausibleconclusionsisavery

subtleprocess.»


Sequences

n  1123581321…

n  1235...

n  1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 3 1 2 2 1 1 …

–  Comment?

–  Pourquoiserait-ilpossibledefairedel’induction?

–  Est-cequ’unexemplesupplémentaire

doitaugmenterlaconfiancedanslarègleinduite?

–  Combienfaut-ild’exemples?


Supervisedinduction

n  Howtochosethedecisionfunction?

x

y


Interrogations

Eachtime:

Specificcases=>generallaworadaptationtoanewcase

1.   Howthisgeneralizationisallowed?

2.   Canweguaranteesomething?


Whatkindoftheoreticalguarantees

oninductioncanweget?


Analysisoftheperceptron


Theperceptron

{ biais

x

y

w1w2

w3 w4 w5

w0

wd

1

x1 x2 x3 x4 x5 xd

x0

neurone de biais

1

yi

x(1)

x(2)

x(3)

x(d)

w1i

w2i

w3i

wdi

σ(i) =d!

j=0

wjix(j)w0i

–  Rosenblatt (1958-1962)


Theperceptron:alineardiscriminant

w


Theperceptron

n  Learningtheweights–  Principle(Hebb’srule):incaseofsuccess,addtoeachweight

(connection)somevalueproportionaltotheinputandoutput

Perceptron’srule:learnonlyincaseoffailure

+


Propertiesthatareremarquable!!

n  Convergenceinafinitenumberofsteps

–  Independentlyofthenumberofexamples

–  Independentlyofthedistributionofexamples

–  (quasi)independentlyofthedimensionoftheinputspace

Ifthereexistsatleastonelinearseparatriceoftheexamples

!!!


Guaranteesaboutgeneralizing??

n  Theoremsovertheperformance

withrespecttothetrainingsample

n  Butwhataboutfutureexamples?


–  Rosenblatt(1958-1962)

ThePerceptron


PAClearning

ProbablyApproximativelyCorrect


Targetclass:rectanglesinR2

n  Sample

–  Positiveinstances

–  Negativeinstances

P+X

P�X

x

y


Targetclass:unknown

n  Whatdowewanttolearn?

Adecisionfonction(prediction)

x

y

!


Targetclass:unknown

n  Howtolearn?

x

y



n  Howtolearn?

–  IfIknowthatthetargetconceptisarectangle

x

y



n  Howtolearn?


x

y

Most general hypotheses



n  Howtolearn?


x

y

Most specific hypotheses



n  Howtolearn?

–  Choiceofonehypothesish

Version

space



n  Learning:choicedeh

–  Whichperformancetoexpect?

x

y

h


Thestatisticaltheoryoflearning

Whichperformance?

n  Costforapredictionerror

–  Thelossfunction

n  WhichexpectedcostifIchooseh?

–  The«realrisk»(ortruerisk)

R(h) =�

X�Y��h(x), y

�pXY(x, y) dx dy

��h(x), y

�



n  Whichexpectedcostwhenhischosen?

–  AssumingthatthereisnotrainingerroronS

x

y

h

The«empiricalrisk»

R(h) =1m

m�

i=1

��h(xi), yi

�


Statisticaltheoryoflearning:theERM

n  Learningstrategy:

–  Selectanhypothesiswithnullempiricalrisk(notrainingerror)

–  Whichgeneralizationperformancetoexpectforh?

x

y

h

x

y

f

h


Statisticaltheoryoflearning:theERM

–  Selectanhypothesiswithnullempiricalrisk(notrainingerror)

–  Whichgeneralizationperformancetoexpectforh?

–  WhatistheriskofgettingerrorR(h)>ε?

x

y

f

h

h � f

x

y

f

h


Centralinterrogation:theinductiveprinciple

n  Theempiricalriskminimizationprinciple(ERM)

…isitsound?

–  IfIchosehsuchthat

–  Ishgoodwithrespecttotherealrisk?

–  CouldIhavedonemuchbetter?

R(h)? ! R(h)



The1erstep

Onehypothesis


StatisticalstudyforONEhypothesis

–  Choseonehypothesisofnulempiricalrisk(noerroronthetrainingsetS)

–  Whichperformancecanweexpectforh?

–  WhatistheriskofhavingR(h)>ε?

x

y

f

h

h � f

x

y

f

h



n  Assumethathst.(his«bad»)

n  Whatistheprobabilitythatnonethelesshhavebeenselected?

x

y

f

h

h � f

R(h) � �

R(h) = pX (h � f)

Afteroneexample: p�R(h

�= 0) � 1� �

Aftermexamples(i.i.d.):

pm�R(h

�= 0) � (1� �)m

Wewant: � ⇥, � � [0, 1] : pm�R(h

�� ⇥) � �

«falls»outside h � f



n  Wewant:

x

y

f

h

h � f

Or:

Hence:

� ⇥, � � [0, 1] : pm�R(h

�� ⇥) � �

(1 � �)m � �

e�� m � �

�⇥ m � ln(�)

m � ln(1/�)⇥

<



The2ndstep

Whichhypothesisinthecrowd


Statisticalstudyfor|H|hypotheses

n  WhatistheprobabilitythatIchoseonehypothesisherrofrealrisk>εandthatIdonotrealizeitaftermexamples?

n  Probabilityofsurvivalofherrafter1example:

n  Probabilityofsurvivalofherraftermexamples:

n  ProbabilityofsurvivalofatleastonehypothesisinH:

–  Weusetheprobabilityoftheunion

n  Wewantthattheprobabilitythatthereremainsatleastonehypothesisofrealrisk>εintheversionspacebeboundedbyδ:


The«PAClearning»analysis

n  Weget:

=0

Realizablecase:thereexistsatleastonefunctionhofrisk0

TheEmpiricalRiskMinimizationprinciple

issoundonlyifthereareconstraintsonthehypothesisspace


PAClearning:definition

n  Worstcaseanalysis

–  AgainstalldistributionsP

–  Foranytargethypothesisinaclassofhypotheses

n  Notionofcomputationalcomplexity

Given 0 < �, " < 1, a concept class C is learnable by a polynomial time algorithm A if,

for any distribution P of samples and any concept c 2 C,

there exists a polynomial p(·, ·, ·) such that

A will produce with probability at least 1� � a hypothesis h 2 C whose error is "

when given at least p(m, 1/�, 1") independent random examples drawn according to P .

[Valiant,1984]



Uniformconvergencebounds

(fortheunrealizablecase)


Generalizingthelawoflargenumbers:uniformconvergence

Theoreme 1 (Inegalite de Hoe�ding). Si les �i sont des variables aleatoires,tirees independamment et selon une meme distribution et prenant leurvaleur dans l’intervalle [a, b], alors :

P

��1m

m�

i=1

�i � E(�)�� ⇥

�� 2 exp

�� 2 m ⇥2

(b� a)2

�

Appliquee au risque empirique et au risque reel, cette inegalite nous donne :

P�|REmp(h)�RReel(h)| ⇤ �

�⇥ 2 exp

�� 2 m �2

(b� a)2�

(1)

si la fonction de perte ⇥ est definie sur l’intervalle [a, b].

Pm[⌅h ⇤ H : RReel(h)�REmp(h) > ⇥] ⇥|H|�

i=1

Pm[RReel(hi)�REmp(hi) > ⇥]

⇥ |H| exp(�2 m ⇥2) = �

en supposant ici que la fonction de perte ⇤ prend ses valeurs dans l’intervalle[0, 1].

« H finite »


Boundingthetrueriskwiththeempiricalrisk+…

n  Hfinite,realizablecase

n  Hfinite,nonrealizablecase

⌅h ⇤ H,⌅� ⇥ 1 : Pm

�RReel(h) ⇥ REmp(h) +

�log |H|+ log 1

�

2 m

�> 1� �

⌅h ⇤ H,⌅� ⇥ 1 : Pm


log |H|+ log 1�

m

�> 1� �


Tosumup:for|H|finite

n  Nonrealizablecase

� =

�log |H|+ log 1

�

2 m and

� =log |H|+ log 1

�

mm �

log |H|+ log 1�

�

m �log |H|+ log 1

�

2 �2

n  Realizablecase

and


|H|infinite!!

n  EffectivedimensionofH=theVapnik-Chervonenkisdimension

–  Combinatorialcriterion

–  Sizeofthelargestsetofpoints(ingeneralconfiguration)thatcanbelabeledinanywaybyhypothesesdrawnfrom H

Boundonthetruerisk

dV C(H) = max�m : �H(m) = 2m

�

⌅h ⇤ H,⌅� ⇥ 1 : Pm


�8 dV C(H) log 2 e m

dV C(H) + 8 log 4�

m

�> 1� �


VCdim:illustrations

n  dVC(linearseparator)=?

+

+ -

+

+

--

+

+

-

+

+

(a) (b) (c)

•  dVC(rectangles) = ?

+

+

-- +

+

-

++

+

-

+

+

-

(a) (b) (c) (d)

+


Lesson

n  Youcannotguaranteeanythingaboutinduction

n  Evenifyouassumethattheworldisstationary

andexamplesarei.i.d.

n  Unlessthereare(severe)constraintsonthehypothesisspace

Butwait…?


TheSuperVisionnetwork

Imageclassificationwithdeepconvolutionalneuralnetworks

–  7hidden“weight”layers

–  650Kneurons

–  60Mparameters

–  630Mconnections

SuperVision (SV)

Image classification with deep convolutional neural networks •  7 hidden “weight” layers •  650K neurons •  60M parameters •  630M connections

•  Rectified Linear Units, overlapping pooling, dropout trick •  Randomly extracted 224x224 patches for more data

h-p://image4net.org/challenges/LSVRC/2012/supervision.pdf+

Signal


GoogleNet

n  Amécanoofneuralnetworks

1x1 semblent triviaux car ils ne permettent pas de reduire la dimension de l’entree, mais son criterenon-lineaire lui permet de complexifier la nature des attributs detectes et donc de voir des motifs pluscomplexes. Network in Network introduit aussi l’utilisation de reseau completement constitue par descouches convolutives, en remplacant les couches de classification par des filtres 1x1 (Figure 10).

FIGURE 10. Module Network in Network [33]

GoogleNet [58] une des architectures les plus utilisees (avec AlexNet) de part ses performances.Developpe par Google et gagnant du l’ILSVRC 2014, le modele se differencie des autres par sa com-plexite (22 couches contre 8 pour AlexNet) et l’utilisation de module inception (Figure 11). Le moduled’inception (Figure 12) est une configuration permettant d’appliquer plusieurs filtres de tailles differentesen parallele. La parallelisation et l’application de multiples filtres permettent d’apprendre plusieurs lo-giques d’extraction d’attributs, allant sur des details precis pour les filtres 1x1 jusqu’a des formes pluslarges pour les filtres 5x5.

FIGURE 11. Architecture du reseau GoogleNet [58]

9


Troublingfindings

Apaper–  C.Zhang,S.Bengio,M.Hardt,B.Recht,O.Vinyals(ICLR,May2017).

“Understandingdeeplearningrequiresrethinkinggeneralization”

Extensiveexperimentsontheclassificationofimages

–  TheAlexNet(>1,000,000parameters)+2otherarchitectures

–  TheCIFAR-10dataset:•  60,000imagescategorizedin10classes(50,000fortrainingand10,000fortesting)

•  Images:32x32pixelsin3colorchannels

Again, on intuitive grounds we expect that in order to make good predic-tions we need to select a hypothesis class F that is appropriate for the problemat hand. More precisely we should use some prior knowledge about the natureof the link between between the features x and the target y to choose whichfunctions the class F should possess. For instance if, for any reason, we knowthat with high probability the relation between x and y is approximately lin-ear we better choose F to contain only such functions fw(x) = w · x. In themost general setting this relationship is encoded in a complicated and unknownprobability distribution P on labeled observations (x, y). In many cases all weknow is that the relation between x and y has some smoothness properties.

The set of techniques that data scientists use to adapt the hypothesis classF to a specific problem is know as regularization. Some of these are explicit inthe sense that they constrain estimators f in some way as we shall describe insection 2. Some are implicit meaning that it is the dynamics of the algorithmwhich walks its way through the set F in search for a good f (typically usingstochastic gradient descent) that provides the regularization. Some of theseregularization techniques actually pertain more to art than to mathematics asthey rely more on experience and intuition than on theorems.

Figure 1: The architecture of AlexNet which is one of the networks used by the authors

in [1]

Deep Learning is a a very popular class of machine learning models, roughlyinspired by biology, that are particularly well suited for tackling complex, AI-like tasks such as image classification, NLP or automatic translation. Roughlyspeaking these models are defined by stacking layers that, each, combine linearcombinations of the input with non-linear activation functions (and perhapssome regularization). We won’t enter into defining them in detail here as manyexcellent textbooks [3, 4] will do the job. Figure 1 shows the architecture ofAlexNet a deep network used in the experiment [1]. For our purpose, which is adiscussion of the issue of generalization and regularization, su�ce it to say herethat these Deep Learning problems share the following facts:

• The number n of samples available for training these networks is typicallymuch smaller than the number k of parameters w = (w1, . . . , wk) thatdefine the functions fw 2 F

1.

• The probability distribution P (x, y) is impossible to describe in any sen-sible way in practice. For concreteness, think of x as the pixels of and

1The number of parameters k of a Deep Learning network such as AlexNet can be over ahundred of millions while being trained on “only” a few millions of images in image-net.

2


Troublingfindings

Experiments

1.  Originaldatasetwithoutmodification

•  Results?–  Trainingaccuracy=100%;Testaccuracy=89%–  Speedofconvergence~5,000steps


Troublingfindings

Experiments



Expectedbehaviorifthecapacityofthehypothesisspaceislimited

i.e.thesystemcannotfitany(arbitrary)trainingdata

8h 2 H, 8� 1 : Pm

"R(h) bR(h) + 2 dRadm(H) + 3

rln(2/�)

m

#> 1� �


Troublingfindings

Experiments



2.  Randomlabels–  Trainingaccuracy=100%!!??;Testaccuracy=9.8%

–  Speedofconvergence=similarbehavior(~10,000steps)

!!!


Troublingfindings

Experiments



2.  Randomlabels–  Trainingaccuracy=100%!!??;Testaccuracy=9.8%–  Speedofconvergence=similarbehavior(~10,000steps)

3.  Randompixels–  Trainingaccuracy=100%!!??;Testaccuracy~10%–  Speedofconvergence=similarbehavior(~10,000steps)

Now, we are in

trouble!!


Troublingfindings

n  DeepNNscanaccommodateANYtrainingset

Can grow without limit!!

But then,

why are deep NNs so good on image classification tasks?

8h 2 H, 8� 1 : Pm

"R(h) bR(h) + 2 dRadm(H) + 3

rln(2/�)

m

#> 1� �