cs 343: artificial intelligencesniekum/classes/343-s19/... · 2019. 8. 29. · data that is...

62
CS 343: Artificial Intelligence Deep Learning Prof. Scott Niekum — The University of Texas at Austin [These slides based on those of Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Upload: others

Post on 30-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

CS343:ArtificialIntelligence

DeepLearning

Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKlein,PieterAbbeel,AncaDraganforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]

Page 2: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Pleasefilloutcourseevalsonline!

Page 3: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Review:LinearClassifiers

Page 4: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

FeatureVectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

Page 5: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Some(Simplified)Biology

▪ Verylooseinspiration:humanneurons

Page 6: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

LinearClassifiers

▪ Inputsarefeaturevalues▪ Eachfeaturehasaweight▪ Sumistheactivation

▪ Iftheactivationis:▪ Positive,output+1▪ Negative,output-1

Σf1f2f3

w1

w2

w3>0?

Page 7: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Non-Linearity

Page 8: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Non-LinearSeparators

▪ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:

▪ Butwhatarewegoingtodoifthedatasetisjusttoohard?

▪ Howabout…mappingdatatoahigher-dimensionalspace:

0

0

0

x2

x

x

x

ThisandnextslideadaptedfromRayMooney,UT

Page 9: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Non-LinearSeparators

▪ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:

Φ: x → φ(x)

Page 10: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

ComputerVision

Page 11: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

ObjectDetection

Page 12: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

ManualFeatureDesign

Page 13: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

FeaturesandGeneralization

[DalalandTriggs,2005]

Page 14: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

FeaturesandGeneralization

Image HoG

Page 15: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

ManualFeatureDesign!DeepLearning

▪ Manualfeaturedesignrequires:

▪ Domain-specificexpertise

▪ Domain-specificeffort

▪ Whatifwecouldlearnthefeatures,too?

▪ DeepLearning

Page 16: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Perceptron

Σf1f2f3

w1

w2

w3

>0?

Page 17: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Two-LayerPerceptronNetwork

Σ

f1

f2

f3w13

w23

w33

>0?

Σw12

w22

w32

>0?

Σw11

w21

w31

>0?

Σ

w1

w2

w3

>0?

Page 18: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

N-LayerPerceptronNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

>0?

Page 19: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Performance

graph credit Matt Zeiler, Clarifai

Page 20: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Performance

graph credit Matt Zeiler, Clarifai

Page 21: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

Page 22: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

Page 23: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Performance

graph credit Matt Zeiler, Clarifai

AlexNet

Page 24: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

SpeechRecognition

graph credit Matt Zeiler, Clarifai

Page 25: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

N-LayerPerceptronNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

>0?

Page 26: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

LocalSearch

▪ Simple,generalidea:▪ Startwherever▪ Repeat:movetothebestneighboringstate▪ Ifnoneighborsbetterthancurrent,quit▪ Neighbors=smallperturbationsofw

▪ Properties▪ Plateausandlocaloptima

Howtoescapeplateausandfindagoodlocaloptimum?Howtodealwithverylargeparametervectors?E.g.,

Page 27: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Perceptron

Σf1f2f3

w1

w2

w3

>0?

▪ Objective:ClassificationAccuracy

▪ Issue:manyplateaus! howtomeasureincrementalprogresstowardacorrectlabel?

Page 28: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Soft-Max

▪ Scorefory=1: Scorefory=-1:

▪ Probabilityoflabel:

▪ Objective:

▪ Log:

Page 29: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Two-LayerNeuralNetwork

Σ

f1

f2

f3w13

w23

w33

>0?

Σw12

w22

w32

>0?

Σw11

w21

w31

>0?

Σ

w1

w2

w3

Page 30: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

N-LayerNeuralNetwork

Σ

f1

f2

f3

>0?

Σ >0?

Σ >0?

Σ

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?

Σ >0?…

Page 31: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

OurStatus

▪ Ourobjective ▪ Changessmoothlywithchangesinw▪ Doesn’tsufferfromthesameplateausastheperceptronnetwork

▪ Challenge:howtofindagoodw?

▪ Equivalently:

Page 32: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

1-doptimization

▪ Couldevaluate and▪ Thenstepinbestdirection

▪ Or,evaluatederivative:

▪ Tellswhichdirectiontostepinto

Page 33: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

2-DOptimization

Source: Thomas Jungblut’s Blog

Page 34: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

▪ Idea:

▪ Startsomewhere▪ Repeat:Takeastepinthesteepestdescentdirection

SteepestDescent

Figure source: Mathworks

Page 35: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

WhatistheSteepestDescentDirection?

Page 36: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

WhatistheSteepestDescentDirection?

▪ SteepestDirection=directionofthegradient

Page 37: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

OptimizationProcedure1:GradientDescent

▪ Init:

▪ Fori=1,2,…

▪ :learningrate---tweakingparameterthatneedstobechosencarefully

▪ How?Trymultiplechoices▪ Cruderuleofthumb:updatechangesabout0.1–1%

Page 38: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201638

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?

Page 39: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201639

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?

Page 40: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201640

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one

Page 41: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

OptimizationProcedure2:Momentum

▪ Init:

▪ Fori=1,2,…

▪ GradientDescent

▪ Init:

▪ Fori=1,2,…

-Physicalinterpretationasballrollingdownthelossfunction+friction(mucoefficient). -mu=usually~0.5,0.9,or0.99(Sometimesannealedovertime,e.g.from0.5->0.99)

▪ Momentum

Page 42: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201642

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Momentum?

Page 43: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Howdoweactuallycomputegradientw.r.t.weights?

Backpropagation!

Page 44: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

1

Backpropagation Learning

15-486/782: Artificial Neural NetworksDavid S. Touretzky

Fall 2006

Page 45: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

2

LMS / Widrow-Hoff Rule

Works fine for a single layer of trainable weights.

What about multi-layer networks?

S

wi

xi

y

wi = −y−dxi

Page 46: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

3

With Linear Units, Multiple Layers Don't Add Anything

U : 2×3 matrix

V : 3×4 matrix

x

Linear operators are closed under composition.Equivalent to a single layer of weights W=U×V

But with non-linear units, extra layers addcomputational power.

y

y = U×V x = U×V 2×4

x

Page 47: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

4

What Can be Done withNon-Linear (e.g., Threshold) Units?

1 layer oftrainable weights

separating hyperplane

Page 48: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

5

2 layers oftrainable weights

convex polygon region

Page 49: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

6

3 layers oftrainable weights

composition of polygons:convex regionsnon

Page 50: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

7

How Do We Train AMulti-Layer Network?

Error = d-yy

Error = ???

Can't use perceptron training algorithm becausewe don't know the 'correct' outputs for hidden units.

Page 51: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

8

How Do We Train AMulti-Layer Network?

y

Define sum-squared error:

E =1

2∑p

dp−yp2

Use gradient descent error minimization:

wij = −∂E

∂wij

Works if the nonlinear transfer function is differentiable.

Page 52: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

9

Deriving the LMS or “Delta” RuleAs Gradient Descent Learning

y = ∑i

wi xi

E = 1

2∑p

dp−yp2 dE

d y= y−d

∂E

∂wi

=dE

d y⋅∂ y

∂wi

= y−dxi

wi = −∂E

∂wi

= −y−dxixi

wi

y

How do we extend this to two layers?

Page 53: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

10

Switch to Smooth Nonlinear Units

net j = ∑i

wij yi

y j = gnet j

Common choices for g:

g x =1

1e−x

g 'x = gx⋅1−gx

g x=tanhx

g 'x=1 /cosh2x

g must be differentiable

Page 54: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

11

Gradient Descent with Nonlinear Units

y=g net=tanh ∑i wi xidE

dy=y−d,

dy

dnet=1/cosh

2net , ∂net

∂wi

=xi

∂E

∂wi

=dE

dy⋅dy

dnet⋅∂net

∂wi

= y−d/cosh2

∑i wi xi⋅xi

tanh(Swix

i)xi

wiy

Page 55: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

12

Now We Can Use The Chain Rule

yk

wjk

y j

wij

yi

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

=∂E

∂netk⋅y j

∂E

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

j =

∂E∂net j

=∂E∂ y j

⋅g'net j

∂E

∂wij

=∂E

∂net j⋅yi

Page 56: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

13

Weight Updates

∂E

∂wjk

=∂E

∂netk⋅∂netk∂wjk

= k⋅y j

∂E

∂wij

=∂E

∂net j⋅∂net j∂wij

= j⋅yi

wjk = −⋅∂E

∂wjk

wij = −⋅∂E

∂wij

Page 57: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201657

Deep learning is everywhere

[Krizhevsky 2012]

Classification Retrieval

Page 58: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201658

[Faster R-CNN: Ren, He, Girshick, Sun 2015]

Detection Segmentation

[Farabet et al., 2012]

Deep learning is everywhere

Page 59: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201659

NVIDIA Tegra X1

self-driving cars

Deep learning is everywhere

Page 60: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201660

[Toshev, Szegedy 2014]

[Mnih 2013]

Deep learning is everywhere

Page 61: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201661

[Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]

Deep learning is everywhere

Page 62: CS 343: Artificial Intelligencesniekum/classes/343-S19/... · 2019. 8. 29. · Data that is linearly separable works out great for linear decision rules: But what are we going to

[Vinyals et al., 2015]

Image Captioning