cs 343: artificial intelligencesniekum/classes/343-s19/... · 2019. 8. 29. · data that is...

CS343:ArtificialIntelligence

DeepLearning

Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKlein,PieterAbbeel,AncaDraganforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]

Pleasefilloutcourseevalsonline!

Review:LinearClassifiers

FeatureVectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

Some(Simplified)Biology

▪ Verylooseinspiration:humanneurons

LinearClassifiers

▪ Inputsarefeaturevalues▪ Eachfeaturehasaweight▪ Sumistheactivation

▪ Iftheactivationis:▪ Positive,output+1▪ Negative,output-1

Σf1f2f3

Non-Linearity

Non-LinearSeparators

▪ Datathatislinearlyseparableworksoutgreatforlineardecisionrules:

▪ Butwhatarewegoingtodoifthedatasetisjusttoohard?

▪ Howabout…mappingdatatoahigher-dimensionalspace:

ThisandnextslideadaptedfromRayMooney,UT

Non-LinearSeparators

▪ Generalidea:theoriginalfeaturespacecanalwaysbemappedtosomehigher-dimensionalfeaturespacewherethetrainingsetisseparable:

Φ: x → φ(x)

ComputerVision

ObjectDetection

ManualFeatureDesign

FeaturesandGeneralization

[DalalandTriggs,2005]

FeaturesandGeneralization

Image HoG

ManualFeatureDesign!DeepLearning

▪ Manualfeaturedesignrequires:

▪ Domain-specificexpertise

▪ Domain-specificeffort

▪ Whatifwecouldlearnthefeatures,too?

▪ DeepLearning

Perceptron

Σf1f2f3

Two-LayerPerceptronNetwork

N-LayerPerceptronNetwork

Σ >0?

Σ >0?…

Performance

graph credit Matt Zeiler, Clarifai

Performance

AlexNet

Performance

AlexNet

Performance

AlexNet

SpeechRecognition

N-LayerPerceptronNetwork

Σ >0?

Σ >0?…

LocalSearch

▪ Simple,generalidea:▪ Startwherever▪ Repeat:movetothebestneighboringstate▪ Ifnoneighborsbetterthancurrent,quit▪ Neighbors=smallperturbationsofw

▪ Properties▪ Plateausandlocaloptima

Howtoescapeplateausandfindagoodlocaloptimum?Howtodealwithverylargeparametervectors?E.g.,

Perceptron

Σf1f2f3

▪ Objective:ClassificationAccuracy

▪ Issue:manyplateaus! howtomeasureincrementalprogresstowardacorrectlabel?

Soft-Max

▪ Scorefory=1: Scorefory=-1:

▪ Probabilityoflabel:

▪ Objective:

▪ Log:

Two-LayerNeuralNetwork

N-LayerNeuralNetwork

Σ >0?

Σ >0?…

OurStatus

▪ Ourobjective ▪ Changessmoothlywithchangesinw▪ Doesn’tsufferfromthesameplateausastheperceptronnetwork

▪ Challenge:howtofindagoodw?

▪ Equivalently:

1-doptimization

▪ Couldevaluate and▪ Thenstepinbestdirection

▪ Or,evaluatederivative:

▪ Tellswhichdirectiontostepinto

2-DOptimization

Source: Thomas Jungblut’s Blog

▪ Idea:

▪ Startsomewhere▪ Repeat:Takeastepinthesteepestdescentdirection

SteepestDescent

Figure source: Mathworks

WhatistheSteepestDescentDirection?

▪ SteepestDirection=directionofthegradient

OptimizationProcedure1:GradientDescent

▪ Init:

▪ Fori=1,2,…

▪ :learningrate---tweakingparameterthatneedstobechosencarefully

▪ How?Trymultiplechoices▪ Cruderuleofthumb:updatechangesabout0.1–1%

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 25 Jan 201638

Suppose loss function is steep vertically but shallow horizontally:

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent?

Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one

OptimizationProcedure2:Momentum

▪ Init:

▪ Fori=1,2,…

▪ GradientDescent

▪ Init:

▪ Fori=1,2,…

-Physicalinterpretationasballrollingdownthelossfunction+friction(mucoefficient). -mu=usually~0.5,0.9,or0.99(Sometimesannealedovertime,e.g.from0.5->0.99)

▪ Momentum

Q: What is the trajectory along which we converge towards the minimum with Momentum?

Howdoweactuallycomputegradientw.r.t.weights?

Backpropagation!

Backpropagation Learning

15-486/782: Artificial Neural NetworksDavid S. Touretzky

Fall 2006

LMS / Widrow-Hoff Rule

Works fine for a single layer of trainable weights.

What about multi-layer networks?

wi = −y−dxi

With Linear Units, Multiple Layers Don't Add Anything

U : 2×3 matrix

V : 3×4 matrix

Linear operators are closed under composition.Equivalent to a single layer of weights W=U×V

But with non-linear units, extra layers addcomputational power.

y = U×V x = U×V 2×4

What Can be Done withNon-Linear (e.g., Threshold) Units?

1 layer oftrainable weights

separating hyperplane

2 layers oftrainable weights

convex polygon region

3 layers oftrainable weights

composition of polygons:convex regionsnon

How Do We Train AMulti-Layer Network?

Error = d-yy

Error = ???

Can't use perceptron training algorithm becausewe don't know the 'correct' outputs for hidden units.

How Do We Train AMulti-Layer Network?

Define sum-squared error:

dp−yp2

Use gradient descent error minimization:

wij = −∂E

∂wij

Works if the nonlinear transfer function is differentiable.

Deriving the LMS or “Delta” RuleAs Gradient Descent Learning

y = ∑i

dp−yp2 dE

d y= y−d

d y⋅∂ y

= y−dxi

wi = −∂E

= −y−dxixi

How do we extend this to two layers?

Switch to Smooth Nonlinear Units

net j = ∑i

wij yi

y j = gnet j

Common choices for g:

g x =1

1e−x

g 'x = gx⋅1−gx

g x=tanhx

g 'x=1 /cosh2x

g must be differentiable

Gradient Descent with Nonlinear Units

y=g net=tanh ∑i wi xidE

dy=y−d,

dnet=1/cosh

2net , ∂net

dy⋅dy

dnet⋅∂net

= y−d/cosh2

∑i wi xi⋅xi

tanh(Swix

Now We Can Use The Chain Rule

∂E∂yk

= yk−dk

k =∂E

∂netk= yk−dk⋅g'netk

∂wjk

∂netk⋅∂netk∂wjk

∂netk⋅y j

∂y j

= ∑k ∂E

∂netk⋅∂netk∂ y j

∂E∂net j

=∂E∂ y j

⋅g'net j

∂wij

∂net j⋅yi

∂E∂yk

= yk−dk

k =∂E

∂wjk

∂netk⋅y j

∂y j

= ∑k ∂E

∂E∂net j

=∂E∂ y j

⋅g'net j

∂wij

∂net j⋅yi

∂E∂yk

= yk−dk

k =∂E

∂wjk

∂netk⋅y j

∂y j

= ∑k ∂E

∂E∂net j

=∂E∂ y j

⋅g'net j

∂wij

∂net j⋅yi

Weight Updates

∂wjk

= k⋅y j

∂wij

∂net j⋅∂net j∂wij

= j⋅yi

wjk = −⋅∂E

∂wjk

wij = −⋅∂E

∂wij

Deep learning is everywhere

[Krizhevsky 2012]

Classification Retrieval

[Faster R-CNN: Ren, He, Girshick, Sun 2015]

Detection Segmentation

[Farabet et al., 2012]

NVIDIA Tegra X1

self-driving cars

[Toshev, Szegedy 2014]

[Mnih 2013]

[Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]

[Vinyals et al., 2015]

Image Captioning

cs 343: artificial intelligencesniekum/classes/343-s19/... · 2019. 8. 29. · data that is...

Documents

separable preference orders

· rivkle tuercas remachables ref. 343 48 040 020 343 48...

adversarial search - university of texas at...

surveyor s12/s19 user manual

the triggering - cw s19

non-separable extensions of quadrature mirror filters to...

phrasal verbs separable/ inseperable

iec separable connectors

s19 3939 fine - porsche

53rd spring session resolutions for discussion on … packet...

directcompute accelerated separable filtering

719 s19-l09 progmodels2-ml

separable assembly flow line

separable space

separable monte carlo

cmc-spiderlift s19 speciﬁcation - speedy powered...

0417 s19 ms 22 - eictta...

s19-270c, s19-270e, s19274c, s19274e - bradley corp · 3...

informed search - university of texas at...

s19 cover page a2 - irot