cs 343: artificial intelligencesniekum/classes/343-s19/... · 2019. 8. 29. · data that is...

Post on 30-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS343:ArtificialIntelligence

DeepLearning

Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKlein,PieterAbbeel,AncaDraganforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]

Pleasefilloutcourseevalsonline!

Review:LinearClassifiers

FeatureVectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

Some(Simplified)Biology

▪ Verylooseinspiration:humanneurons

LinearClassifiers

▪ Inputsarefeaturevalues▪ Eachfeaturehasaweight▪ Sumistheactivation

▪ Iftheactivationis:▪ Positive,output+1▪ Negative,output-1

Σf1f2f3

w1

w2

w3>0?

Non-Linearity

Non-LinearSeparators

▪ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:

▪ Butwhatarewegoingtodoifthedatasetisjusttoohard?

▪ Howabout…mappingdatatoahigher-dimensionalspace:

0

0

0

x2

x

x

x

ThisandnextslideadaptedfromRayMooney,UT

Non-LinearSeparators

▪ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:

Φ: x → φ(x)

ComputerVision

ObjectDetection

ManualFeatureDesign

FeaturesandGeneralization

[DalalandTriggs,2005]

FeaturesandGeneralization

Image HoG

ManualFeatureDesign!DeepLearning

▪ Manualfeaturedesignrequires:

▪ Domain-specificexpertise

▪ Domain-specificeffort

▪ Whatifwecouldlearnthefeatures,too?

▪ DeepLearning

Perceptron

Σf1f2f3

w1

w2

w3

>0?

Two-LayerPerceptronNetwork

Σ

f1

f2

f3w13

w23

w33

>0?

Σw12

w22

w32

>0?

Σw11

w21

w31

>0?

Σ

w1

w2

w3

>0?

N-LayerPerceptronNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

>0?

Performance

graph credit Matt Zeiler, Clarifai

Performance

graph credit Matt Zeiler, Clarifai

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

SpeechRecognition

graph credit Matt Zeiler, Clarifai

N-LayerPerceptronNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

>0?

LocalSearch

▪ Simple,generalidea:▪ Startwherever▪ Repeat:movetothebestneighboringstate▪ Ifnoneighborsbetterthancurrent,quit▪ Neighbors=smallperturbationsofw

▪ Properties▪ Plateausandlocaloptima

Howtoescapeplateausandfindagoodlocaloptimum?Howtodealwithverylargeparametervectors?E.g.,

Perceptron

Σf1f2f3

w1

w2

w3

>0?

▪ Objective:ClassificationAccuracy

▪ Issue:manyplateaus! howtomeasureincrementalprogresstowardacorrectlabel?

Soft-Max

▪ Scorefory=1: Scorefory=-1:

▪ Probabilityoflabel:

▪ Objective:

▪ Log:

Two-LayerNeuralNetwork

Σ

f1

f2

f3w13

w23

w33

>0?

Σw12

w22

w32

>0?

Σw11

w21

w31

>0?

Σ

w1

w2

w3

N-LayerNeuralNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

OurStatus

▪ Ourobjective ▪ Changessmoothlywithchangesinw▪ Doesn’tsufferfromthesameplateausastheperceptronnetwork

▪ Challenge:howtofindagoodw?

▪ Equivalently:

1-doptimization

▪ Couldevaluate and▪ Thenstepinbestdirection

▪ Or,evaluatederivative:

▪ Tellswhichdirectiontostepinto

2-DOptimization

Source: Thomas Jungblut’s Blog

▪ Idea:

▪ Startsomewhere▪ Repeat:Takeastepinthesteepestdescentdirection

SteepestDescent

Figure source: Mathworks

WhatistheSteepestDescentDirection?

WhatistheSteepestDescentDirection?

▪ SteepestDirection=directionofthegradient

OptimizationProcedure1:GradientDescent

▪ Init:

▪ Fori=1,2,…

▪ :learningrate---tweakingparameterthatneedstobechosencarefully

▪ How?Trymultiplechoices▪ Cruderuleofthumb:updatechangesabout0.1–1%

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201638

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201639

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201640

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one

OptimizationProcedure2:Momentum

▪ Init:

▪ Fori=1,2,…

▪ GradientDescent

▪ Init:

▪ Fori=1,2,…

-Physicalinterpretationasballrollingdownthelossfunction+friction(mucoefficient). -mu=usually~0.5,0.9,or0.99(Sometimesannealedovertime,e.g.from0.5->0.99)

▪ Momentum

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201642

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Momentum?

Howdoweactuallycomputegradientw.r.t.weights?

Backpropagation!

1

Backpropagation Learning

15-486/782: Artificial Neural NetworksDavid S. Touretzky

Fall 2006

2

LMS / Widrow-Hoff Rule

Works fine for a single layer of trainable weights.

What about multi-layer networks?

S

wi

xi

y

wi = −y−dxi

3

With Linear Units, Multiple Layers Don't Add Anything

U : 2×3 matrix

V : 3×4 matrix

x

Linear operators are closed under composition.Equivalent to a single layer of weights W=U×V

But with non-linear units, extra layers addcomputational power.

y

y = U×V x = U×V 2×4

x

4

What Can be Done withNon-Linear (e.g., Threshold) Units?

1 layer oftrainable weights

separating hyperplane

5

2 layers oftrainable weights

convex polygon region

6

3 layers oftrainable weights

composition of polygons:convex regionsnon

7

How Do We Train AMulti-Layer Network?

Error = d-yy

Error = ???

Can't use perceptron training algorithm becausewe don't know the 'correct' outputs for hidden units.

8

How Do We Train AMulti-Layer Network?

y

Define sum-squared error:

E =1

2∑p

dp−yp2

Use gradient descent error minimization:

wij = −∂E

∂wij

Works if the nonlinear transfer function is differentiable.

9

Deriving the LMS or “Delta” RuleAs Gradient Descent Learning

y = ∑i

wi xi

E = 1

2∑p

dp−yp2 dE

d y= y−d

∂E

∂wi

=dE

d y⋅∂ y

∂wi

= y−dxi

wi = −∂E

∂wi

= −y−dxixi

wi

y

How do we extend this to two layers?

10

Switch to Smooth Nonlinear Units

net j = ∑i

wij yi

y j = gnet j

Common choices for g:

g x =1

1e−x

g 'x = gx⋅1−gx

g x=tanhx

g 'x=1 /cosh2x

g must be differentiable

11

Gradient Descent with Nonlinear Units

y=g net=tanh ∑i wi xidE

dy=y−d,

dy

dnet=1/cosh

2net , ∂net

∂wi

=xi

∂E

∂wi

=dE

dy⋅dy

dnet⋅∂net

∂wi

= y−d/cosh2

∑i wi xi⋅xi

tanh(Swix

i)xi

wiy

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

13

Weight Updates

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

= k⋅y j

∂E

∂wij

=∂E

∂net j⋅∂net j∂wij

= j⋅yi

wjk = −⋅∂E

∂wjk

wij = −⋅∂E

∂wij

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201657

Deep learning is everywhere

[Krizhevsky 2012]

Classification Retrieval

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201658

[Faster R-CNN: Ren, He, Girshick, Sun 2015]

Detection Segmentation

[Farabet et al., 2012]

Deep learning is everywhere

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201659

NVIDIA Tegra X1

self-driving cars

Deep learning is everywhere

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201660

[Toshev, Szegedy 2014]

[Mnih 2013]

Deep learning is everywhere

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201661

[Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]

Deep learning is everywhere

[Vinyals et al., 2015]

Image Captioning

top related