deep learning for vision - cse.msu.educse803/lectures/deeplearningtutorial.pptx.pdf · deep...

Deep Learning for Vision

Xiaoming Liu

Slides adapted from Adam Coates

What do we want ML to do?

•  Given image, predict complex high-‐level pa>erns:

Object recogniCon DetecCon SegmentaCon

“Cat”

[MarCn et al., 2001]

How is ML done?

•  Machine learning oNen uses common pipeline with hand-‐designed feature extracCon. •  Final ML algorithm learns to make decisions starCng from

the higher-‐level representaCon. •  SomeCmes layers of increasingly high-‐level abstracCons.

–  Constructed using prior knowledge about problem domain.

Feature ExtracCon Machine Learning Algorithm “Cat”?

Prior Knowledge, Experience

“Deep Learning”

•  Deep Learning •  Train mul$ple layers of features/abstracCons from data. •  Try to discover representa$on that makes decisions easy.

Low-‐level Features

Mid-‐level Features

High-‐level Features Classifier “Cat”?

Deep Learning: train layers of features so that classifier works well.

More abstract representaCon

“Deep Learning”

•  Why do we want “deep learning”? –  Some decisions require many stages of processing.

•  Easy to invent cases where a “deep” model is compact but a shallow model is very large / inefficient.

– We already, intuiCvely, hand-‐engineer “layers” of representaCon. •  Let’s replace this with something automated!

– Algorithms scale well with data and compuCng power. •  In pracCce, one of the most consistently successful ways to get good results in ML.

•  Can try to take advantage of unlabeled data to learn representaCons before the task.

Have we been here before? Ø Yes. –  Basic ideas common to past ML and neural networks research. •  Supervised learning is straight-‐forward. •  Standard ML development strategies sCll relevant. •  Some knowledge carried over from problem domains.

Ø No. –  Faster computers; more data. –  Be>er opCmizers; be>er iniCalizaCon schemes.

•  “Unsupervised pre-‐training” trick [Hinton et al. 2006; Bengio et al. 2006]

–  Lots of empirical evidence about what works. •  Made useful by ability to “mix and match” components. [See, e.g., Jarre> et al., ICCV 2009]

Real impact

•  DL systems are high performers in many tasks over many domains.

Image recogniCon [E.g., Krizhevsky et al., 2012]

Speech recogniCon [E.g., Heigold et al., 2013]

NLP [E.g., Socher et al., ICML 2011; Collobert & Weston, ICML 2008]

[Honglak Lee]

Outline •  ML refresher / crash course

–  LogisCc regression –  OpCmizaCon –  Features

•  Supervised deep learning –  Neural network models –  Back-‐propagaCon –  Training procedures

•  Supervised DL for images –  Neural network architectures for images. –  ApplicaCon to Image-‐Net

•  References / Resources

MACHINE LEARNING REFRESHER

Crash Course

Supervised Learning

•  Given labeled training examples:

•  For instance: x(i) = vector of pixel intensiCes. y(i) = object class ID.

•  Goal: find f(x) to predict y from x on training data. – Hopefully: learned predictor works on “test” data.

255 98 93 87 …

f(x) y = 1 (“Cat”)

Logistic Regression •  Simple binary classificaCon algorithm – Start with a funcCon of the form:

–  InterpretaCon: f(x) is probability that y = 1. •  Sigmoid “nonlinearity” squashes linear funcCon to [0,1].

– Find choice of that minimizes objecCve:

Optimization

•  How do we tune to minimize ? •  One algorithm: gradient descent –  Compute gradient:

–  Follow gradient “downhill”:

•  StochasCc Gradient Descent (SGD): take step using gradient from only small batch of examples. –  Scales to larger datasets. [Bo>ou & LeCun, 2005]

Is this enough?

•  Loss is convex à we always find minimum. •  Works for simple problems: –  Classify digits as 0 or 1 using pixel intensity. –  Certain pixels are highly informaCve -‐-‐-‐ e.g., center pixel.

•  Fails for even slightly harder problems. –  Is this a coffee mug?

Why is vision so hard?

“Coffee Mug”

Pixel Intensity

Pixel intensity is a very poor representaCon.

pixel 1

pixel 2

+ Coffee Mug

Not Coffee Mug -‐

+ -‐

pixel 1

pixel 2

Pixel Intensity [72 160]

pixel 1

pixel 2

-‐ +

-‐ -‐

+ -‐ +

+ Coffee Mug

Not Coffee Mug -‐

pixel 1

pixel 2

-‐ +

-‐ -‐

+ -‐ +

Is this a Coffee Mug?

Learning Algorithm

Features

+ handle?

cylinder?

-‐ + + -‐ -‐

-‐ +

+ Coffee Mug

Not Coffee Mug -‐

cylinder? handle?

Is this a Coffee Mug?

Learning Algorithm + handle?

cylinder?

-‐ + + -‐ -‐

-‐ +

Features •  Features are usually hard-‐wired transformaCons built into the system. – Formally, a funcCon that maps raw input to a “higher level” representaCon.

– Completely staCc -‐-‐-‐ so just subsCtute φ(x) for x and do logisCc regression like before.

Where do we get good features?

Features

•  Huge investment devoted to building applicaCon-‐specific feature representaCons. –  Find higher-‐level pa>erns so that final decision is easy to learn with ML algorithm.

Object Bank [Li et al., 2010] Super-‐pixels [Gould et al., 2008; Ren & Malik, 2003]

SIFT [Lowe, 1999] Spin Images [Johnson & Hebert, 1999]

SUPERVISED DEEP LEARNING

Extension to neural networks

Basic idea

•  We saw how to do supervised learning when the “features” φ(x) are fixed. – Let’s extend to case where features are given by tunable funcCons with their own parameters.

Inputs are “features”-‐-‐-‐one feature for each row of W: Outer part of funcCon is same

as logisCc regression.

Basic idea

•  To do supervised learning for two-‐class classificaCon, minimize:

•  Same as logisCc regression, but now f(x) has mulCple stages (“layers”, “modules”):

Intermediate representaCon (“features”) PredicCon for

Neural network

•  This model is a sigmoid “neural network”:

Flow of computaCon. “Forward prop”

“Neuron”

Neural network •  Can stack up several layers: Must learn mulCple stages

of internal “representaCon”.

Back-propagation

•  Minimize:

•  To minimize we need gradients:

– Then use gradient descent algorithm as before.

•  Formula for can be found by hand (same as before); but what about W?

The Chain Rule •  Suppose we have a module that looks like:

•  And we know and , chain rule gives:

Similarly for W:

Ø  Given gradient with respect to output, we can build a new “module” that finds gradient with respect to inputs.

Jacobian matrix.

The Chain Rule •  Easy to build toolkit of known rules to compute gradients given –  Automated differenCaCon! E.g., Theano [Bergstra et al., 2010]

FuncCon Gradient w.r.t. input Gradient w.r.t. parameters

Back-propagation •  Can re-‐apply chain rule to get gradients for all intermediate values and parameters.

“Backward” modules for each forward stage.

Example

•  Given , compute :

Using several items from our table:

Training Procedure •  Collect labeled training data –  For SGD: Randomly shuffle aNer each epoch!

•  For a batch of examples: –  Compute gradient w.r.t. all parameters in network.

– Make a small update to parameters.

–  Repeat unCl convergence.

Training Procedure

•  Historically, this has not worked so easily. – Non-‐convex: Local minima; convergence criteria. – OpCmizaCon becomes difficult with many stages.

•  “Vanishing gradient problem” – Hard to diagnose and debug malfuncCons.

•  Many things turn out to ma>er: – Choice of nonlineariCes. –  IniCalizaCon of parameters. – OpCmizer parameters: step size, schedule.

Nonlinearities

•  Choice of funcCons inside network ma>ers. – Sigmoid funcCon turns out to be difficult. – Some other choices oNen used:

tanh(z) ReLu(z) = max{0, z}

“RecCfied Linear Unit” à Increasingly popular.

abs(z)

[Nair & Hinton, 2010]

Initialization

•  Usually small random values. –  Try to choose so that typical input to a neuron avoids saturaCng / non-‐differenCable areas.

–  Has to be random otherwise every neuron will be equal! –  Occasionally inspect units for saturaCon / blowup. –  Larger values may give faster convergence, but worse models!

•  IniCalizaCon schemes for parCcular units: –  tanh units: Unif[-‐r, r]; sigmoid: Unif[-‐4r, 4r]. See [Glorot et al., AISTATS 2010]

Optimization: Step sizes •  Choose SGD step size carefully. –  Up to factor ~2 can make a difference.

•  Strategies: –  Brute-‐force: try many; pick one with best result. –  Racing: pick size with best error on validaCon data aNer T steps. •  Not always accurate if T is too small.

•  Step size schedule: –  Fixed: same step size –  Step – MulC-‐Step –  Inverse

Bengio, 2012: “PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures” Hinton, 2010: “A PracCcal Guide to Training Restricted Boltzmann Machines”

Optimization: Momentum •  “Smooth” esCmate of gradient from several steps of SGD:

•  A li>le bit like second-‐order informaCon. –  High-‐curvature direcCons cancel out. –  Low-‐curvature direcCons “add up” and accelerate.

Bengio, 2012: “PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures” Hinton, 2010: “A PracCcal Guide to Training Restricted Boltzmann Machines”

Other factors

•  “Weight decay” penalty can help. – Add small penalty for squared weight magnitude.

•  For modest datasets, LBFGS or second-‐order methods are easier than SGD. – See, e.g.: Martens & Sutskever, ICML 2011. – Can crudely extend to mini-‐batch case if batches are large. [Le et al., ICML 2011]

SUPERVISED DL FOR VISION

ApplicaCon

Working with images

•  Major factors: –  Choose funcConal form of network to roughly match the computaCons we need to represent. •  E.g., “selecCve” features and “invariant” features.

–  Try to exploit knowledge of images to accelerate training or improve performance.

•  Generally try to avoid wiring detailed visual knowledge into system -‐-‐-‐ prefer to learn.

Local connectivity

•  Neural network view of single neuron: Extremely large number of connecCons. à More parameters to train. à Higher computaConal expense. à Turn out not to be helpful in pracCce.

Local connectivity •  Reduce parameters with local connecCons.

–  Weight vector is a spaCally localized “filter”.

Local connectivity

•  SomeCmes think of neurons as viewing small adjacent windows. –  Specify connecCvity by the size (“recepCve field” size) and spacing (“step” or “stride”) of windows. •  Typical RF size = 3 to 20 •  Typical step size = 1 pixel up to RF size.

Local connectivity

•  SpaCal organizaCon of filters means output features can also be organized like an image. –  X,Y dimensions correspond to X,Y posiCon of neuron window.

–  “Channels” are different features extracted from same spaCal locaCon. (Also called “feature maps”, or “maps”.)

1D input

1-‐dimensional example:

X spaCal locaCon

“Channel” or “map” index

Local connectivity Ø We can treat output of a layer like an image and re-‐use the

same tricks.

1D input

X spaCal locaCon

Weight-Tying

•  Even with local connecCons, may sCll have too many weights. – Trick: constrain some weights to be equal if we know that some parts of input should learn same kinds of features.

–  Images tend to be “staConary”: different patches tend to have similar low-‐level structure. Ø Constrain weights used at different spaCal posiCons to be the equal.

Weight-Tying Ø  Before, could have neurons with different weights at different

locaCons. But can reduce parameters by making them equal.

1D input

X spaCal locaCon

•  SomeCmes called a “convoluConal” network. Each unique filter is spaCally convolved with the input to produce responses for each map. [LeCun et al., 1989; LeCun et al., 2004]

Pooling •  FuncConal layers designed to represent invariant features. •  Usually locally connected with specific nonlineariCes.

–  Combined with convoluCon, corresponds to hard-‐wired translaCon invariance.

•  Usually fix weights to local box or Gaussian filter. –  Easy to represent max-‐, average-‐, or 2-‐norm pooling.

[Scherer et al., ICANN 2010] [Boureau et al., ICML 2010]

Batch Normalization

•  Reduce covariance shiN during training •  SpaCally over all samples in each batch

… Sample 1 Sample 2 Sample m

µB ←1

W × H ×mxw,h,i

w,h,i∑

σ B2 ← 1

W × H ×m(xw,h,i −

w,h,i∑ µB )

w,h,i ←xi − µB

σ B2 + ε

yi ←γ x∧

w,h,i+ β ≡ BNγ ,β (xw,h,i )

Application: Image-Net •  System from Krizhevsky et al., NIPS 2012: –  ConvoluConal neural network. – Max-‐pooling. –  RecCfied linear units (ReLu). –  Local response normalizaCon. –  Local connecCvity.

Application: Image-Net

•  Top result in LSVRC 2012: ~85%, Top-‐5 accuracy.

What’s an Agaric!?

More applications •  SegmentaCon: predict classes of pixels / super-‐pixels.

•  DetecCon: combine classifiers with sliding-‐window architecture. –  Economical when used with convoluConal nets.

•  RoboCc grasping. [Lenz et al., RSS 2013]

ß Ciresan et al., NIPS 2012

Farabet et al., ICML 2012 à

Pierre Sermanet (2010) à

h>p://www.youtube.com/watch?v=f9CuzqI1SkE

AlphaGo beat best human (Lee Sedol) 3/2016 (win-win-win-loss-win) Games are available at https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

How does AlphaGo work? (For all the details, see their paper:

www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf ) •  Combines two techniques:

– Monte Carlo Tree Search (MCTS) •  Historical advantages of MCTS over minimax-‐search approaches: –  Does not require an evaluaCon funcCon –  Typically works be>er with large branching factors

•  Improved Go programs by ~10 kyu around 2006 – Deep Learning

•  ‘Value networks’ to evaluate board posiCons and •  ‘Policy networks’ to select moves

AlphaGo hardware

•  EvaluaCng policy and value networks requires several orders of magnitude more computaCon than tradiConal search heurisCcs

•  AlphaGo uses an asynchronous mulC-‐threaded search that executes simulaCons on CPUs, and computes policy and value networks in parallel on GPUs

•  Final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs

•  (They also implemented a distributed version)

Speech Recognition

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-‐to-‐end speech recogniCon”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acousCc features (spectrograms)

No phoneme and lexicon (No OOV problem)

+ null (~)

Resources

Tutorials Stanford Deep Learning tutorial:

h>p://ufldl.stanford.edu/wiki

Deep Learning tutorials list: h>p://deeplearning.net/tutorials

IPAM DL/UFL Summer School: h>p://www.ipam.ucla.edu/programs/gss2012/

ICML 2012 RepresentaCon Learning Tutorial h>p://www.iro.umontreal.ca/~bengioy/talks/deep-‐learning-‐tutorial-‐2012.html

References h>p://www.stanford.edu/~acoates/bmvc2013refs.pdf Overviews: Yoshua Bengio,

“PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures”

Yoshua Bengio & Yann LeCun,

“Scaling Learning Algorithms towards AI” Yoshua Bengio, Aaron Courville & Pascal Vincent,

“RepresentaCon Learning: A Review and New PerspecCves”

SoNware: Theano GPU library: h>p://deeplearning.net/soNware/theano SPAMS toolkit: h>p://spams-‐devel.gforge.inria.fr/

deep learning for vision - cse.msu.educse803/lectures/deeplearningtutorial.pptx.pdf · deep...

Documents

xiaoming wang education professional experiences

computer vision: deep learning lab course...18 introduction...

kinectic vision looking deep into depth

complete deep computer-vision methodology for

deep learning for computer vision: deep networks (upc 2016)

computer vision with deep learning

deep learning vs. traditional computer vision

introduction to deep learning - ccrma to deep learning steve...

deep parameter selection for classic computer vision

p05 deep boltzmann machines cvpr2012 deep learning methods...

deep learning for...

computer vision, machine learning, deep learning ·...

20151216 computer vision meetup deep learning deep learning...

cde marketplace: deep vision

computer vision deep learning for · deep learning for...

deep learning & feature learning methods for vision

deep learning for computer vision

deep learning for computer vision: language and vision (upc...

traditional computer vision vs. deep neural networks

deep learning for computer vision – iii