computer vision group kthfiles.meetup.com/18450611/1_10-march-15-josephine-talk2...convnets ! much...

Deep Convolutional Networks & Computer Vision

J. Sullivan, H. Azizpour, A. S. Razavian, A. Maki and S. Carlsson

Computer Vision Group,

KTH.

March 10, 2015

What Deep Learning has done for Computer Vision?

Deep Learning has resulted in

1. much better automatic

- visual image classification and

- object detection,

2. much more powerful generic image representations.

What ConvNets have done for Computer Vision?

ConvNets have resulted in

1. much better automatic

- visual image classification and

- object detection,

2. much more powerful generic image representations.

Image Classification Task: ILSVRCILSVRC!Task!2:!Classifica1on!

Output:*Scale!TPshirt!

Steel!drum!Drums1ck!Mud!turtle!

Steel!drum!

✔! ✗!Output:*Scale!TPshirt!

Giant!panda!Drums1ck!Mud!turtle!

Error =1

100, 000

X

100, 000 images

1(incorrect on image i)

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

ConvNets ! much better image classification

2010 2011 2012 2013 2014

0

10

20

30

28.2

25.8

16.4

11.7

6.7

Classificationerror(%)

Performance of winning entry in ILSVRC competitions (2010-14).

Red indicates when deep ConvNets were introduced.

Pascal VOC: Object Detection

PASCAL(VOC(200562012(

Classifica>on:=person,=motorcycle=Detec4on( Segmenta4on(

Person=

Motorcycle(

Ac>on:=riding=bicycle=

Everingham,(Van(Gool,(Williams,(Winn(and(Zisserman.(The(PASCAL(Visual(Object(Classes((VOC)(Challenge.(IJCV(2010.(

20=object=classes = =22,591=images=

ConvNets ! much better object detection

2007 2008 2009 2010 2011 2012 2013 2014 2015

10

20

30

40

50

60

70

80

Deep learning

Year

Accuracy

plant

person

chair

cat

car

aeroplane

all classes

Progress of object detection for the Pascal VOC 2007 challenge.

ConvNets ! much better image representation

ObjectClassification

SceneClassification

BirdSubcategorization

FlowersRecognition

Hum

anAttributeDetection

ObjectAttributeDetection

ParisBuildingsRetrieval

OxfordBuildingsRetrieval

SculpturesRetrieval

SceneImageRetrieval

ObjectInstanceRetrieval

40

60

80

100

71.1

64

56.8

80.7

69.9

89.5

74.9

81.7

45.4

81.9

89.3

77.2

69

61.8

86.8

73

91.4

79.5

68

42.3

84.3

91.1

Best state-of-the-art ConvNet o↵-the-shelf + Linear SVM

Source: CNN Features o↵-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.

Reason for jump in performance:

Learn feature hierarchies from the data

Modern Visual Recognition Systems

1. Training Phase

- Gather labelled training data.

- Extract a feature representation for each training example.

- Construct a decision boundary.

2. Test Phase

- Extract feature representation from the test example.

- Compare to the learnt decision boundary.

Modern Visual Recognition Systems

1. Training Phase

- Gather labelled training data.

- Extract a feature representation for each training example.

- Construct a decision boundary.

2. Test Phase

- Extract feature representation from the test example.

- Compare to the learnt decision boundary.

It’s just supervised learning.

Is it a bike or a face?

?

Construct a decision boundary

Decision Boundary

The two extremes of feature extraction

Ideal features Far from ideal

The two extremes of feature extraction

Ideal features Far from ideal

Supervised Deep Learning allows you to learn ideal features.

Learning Representations/Features

Traditional Pattern Recognition: Fixed/Handcrafted feature extraction

f

FeatureExtractor

TrainableClassifier

Modern Pattern Recognition: Unsupervised mid-level features

f

FeatureExtractor

TrainableClassifier

Mid-levelFeatures

Deep Learning: Train hierarchical representations

f

Low-levelFeatures

Mid-levelFeatures

High-levelFeatures

TrainableClassifier

Source: Talk Computer Perception with Deep Learning by Yann LeCun

http://web.mit.edu/course/other/i2course/www/vision_and_learning/lecun-20131025-mit.pdf

Key Properties of Deep Learning

Provides a mechanism to:

• Learn a highly non-linear function.

• Learn it from data.

• Build feature hierarchies

- Distributed representations

- Compositionality

• Perform end-to-end learning.

How? Convolutional Networks

Convolutional Networks

• Are deployed in many practical applicationsImage recognition, speech recognition, Google’s and Baidu’s photo

taggers

• Have won several competitionsImageNet, Kaggle Facial Expression, Kaggle Multimodal Learning,

German Tra�c Signs, Connectomics, Handwriting...

• Are applicable to array data where nearby values are correlatedImages, sound, time-frequency representations, video, volumetric

images, RGB-Depth images....



Convolutional NetworkY LeCun

The Convolutional Net Model (Multistage Hubel-Wiesel system)

pooling subsampling

“Simple cells”“Complex cells”

Multiple convolutions

Retinotopic Feature Maps

[LeCun et al. 89][LeCun et al. 98]

Training is supervisedWith stochastic gradient descent

• Training is supervised and with stochastic gradientdescent.

• LeCun et al. ’89, ’98



ConvNets: History

• Fukushima 1980: designed network with same basic structure butdid not train by backpropagation.

• LeCun from late 80s: figured out backpropagation for ConvNets,popularized and deployed ConvNets for OCR applications etc.

• Poggio from 1999: same basic structure but learning is restrictedto top layer (k-means at second stage).

• LeCun from 2006: unsupervised feature learning.

• DiCarlo from 2008: large scale experiments, normalization layer.

• LeCun from 2009: harsher non-linearities, normalization layer,learning unsupervised and supervised.

• Mallat from 2011: provides a theory behind the architecture.

• Hinton 2012: use bigger nets, GPUs, more data, purely supervised.

26

TIM

E

Convolutional Neural Net 2012



Q.: Did we make any prgress since then?

A.: The main reason for the breakthrough is: data and GPU, but we have also made networks deeper and more non-linear.

Reasons for breakthrough now:

• Data and GPUs,

• Networks have been made deeper.

Modern Convolutional Network

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Alex Net 2012

Convolutional Networks for RGB Images: The Basic Operations

Convolution Operation

• Input:

- a set of 2d feature maps x1:m = {x1, . . . ,xm}- each xi has size W ⇥W

• Convolutional Parameters:

- a set of 2d convolutional kernels k1:m = {k1, . . . ,km}- each ki has size (2w + 1)⇥ (2w + 1) and

- a bias term b

224⇥224⇥3224⇥224⇥48

Input Image Convolutionresponse maps

Convolution Operation

Convolutional Operator:

• Define conv(·, ·, ·), the convolution of x1:m with k1:m, as:

conv(x1:m,k1:m, b) =

mX

i=1

(xi ⇤ ki) + b

the 2d convolution xi ⇤ ki returns a 2d map with (x, y)th entry:

x

i,x,y

=wX

x

0=�w

wX

y

0=�w

k

i,x

0+w+1,y0+w+1 xi,x+x

0,y+y

0

224⇥224⇥3224⇥224⇥48

Input Image Convolutionresponse maps

Remember 2D convolution

Image f

Spatial domain

Origin

(x, y)w(-1, -1) w(-1,0) w(-1,1)

w(0,0)

w(1,0) w(1,1)

w(0,1)w(0,-1)

w(1,-1)

Filter coefficientsf(x-1, y-1) f(x-1,y) f(x-1,y+1)

f(x,y)

f(x+1,y) f(x+1,y+1)

f(x,y+1)f(x,y-1)

f(x+1,y--1)

Pixels of image

section under filter

g(x, y) =

aX

s=�a

bX

t=�b

w(s, t) f(x+ s, y + t)

Next non-linear activation and then max-pool

Create a new 2d feature map by applying two more operators:

˜

x = pool(�(conv(x1:m,k1:m, b)))

where- �(·) is a non-linear function typically �(x) = max(0, x)

- pool(·) represents a local max-pooling operator.

224⇥224⇥3224⇥224⇥48

55⇥55⇥48

Input Image Activationresponse maps

Max-pooledresponse maps

From one convolutional layer to the next

• At convolutional layer l have a set of 2d feature maps

x

(l)1:ml

= {x(l)1 , . . . ,x

(l)ml

}

• Have multiple sets of convolutional kernels k(l+1)j,1:ml

, j = 1, . . . ,ml+1

• For each kernel set k(l+1)j,1:ml

create a new set of 2d feature map:

x

(l+1)j = pool

⇣�

⇣conv

⇣x

(l)1:ml

,k

(l+1)j,1:ml

, b

(l+1)j

⌘⌘⌘

Convolve ! Activation ! Max-pool

55⇥55⇥4855⇥55⇥128

27⇥27⇥128

Layer 1 output Activationresponse maps

Max-pooledresponse maps

For j = 1, . . . ,ml+1

- Convolve current response maps with k

(l+1)j,1:ml

:

ˆ

x

(l+1)j = conv

⇣x

(l)1:ml

,k

(l+1)j,1:ml

, b

(l+1)j

⌘

- Non-linear activation:

z

(l+1)j = �

⇣ˆ

x

(l+1)j

⌘

- Max-pool:

x

(l+1)j = pool

⇣z

(l+1)j

⌘

AlexNet 2012

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000


1st fully connected layer

For j = 1, . . . ,mlc

+1

x

(lc

+1)j = max

ml

cX

i=1

w

(lc

+1)j,i · x(l

c

)i + b

(lc

)j , 0

!

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000


2nd fully connected layer

For j = 1, . . . ,mlc

+2

x

(lc

+2)j = max

⇣w

(lc

+2)j · x(l

c

+1)+ b

(lc

+2)j , 0

⌘

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000


Output layer: soft-max operator

• For j = 1, . . . , C:

o

0j = w

(lc

+3)j · x(l

c

+2)+ b

(lc

+3)j , or =

exp(o

0r)

PCj=1 exp

⇣o

0j

⌘

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000


Parameters of the model

• Filter parameters:

- Convolutional layers

For l = 1, . . . , lc

k

(l)j,1:ml�1

1 j ml, each k

(l)j,i has size wl ⇥ wl

- Fully connnected layers

* First fully connected layer:

w

(lc+1)j,i , 1 j mlc+1, 1 i mlc ,

each w

(lc+1)j,i and x

(lc)i have equal size.

* Subsequent fully connected layers:

For l = lc + 1, . . . , lc + L

w

(l)j , 1 j ml, each w

(l)j has size ml�1 ⇥ml�1

Training Convolutional Networks

• Set-up - Supervised learning

- For RGB image x set ConvNet’s first set of 2d feature maps:

x

(0)1:3 = {xred channel,xgreen channel,xblue channel}

- Have a set D of labelled training images i.e. have many (x,y)

- To learn the network’s parameters must link value of

W = {Wconvolutional,Wfully connected}

to network’s prediction performance on D.

Training ConvNets: Measuring Performance

- Remember a ConvNet represents a function:

fConvNet : [0, 1]W⇥W⇥3 ⇥ Rp ! [0, 1]

M

so for input x the function fConvNet predicts its label

fConvNet(x;W) =

ˆ

y

- Use a loss function to measure the error in fConvNet(x;W)’spredicted label for input x in D.

- Loss function typically has the property that

L(y, y) increases as ky � yk increases

- Cross-Entropy loss frequently used

L(y, fConvNet(x;W)) = �MX

j=1

yj log (yj)

Training ConvNets: The Optimization Problem

• Define performance of a network with parameters W on D as

E(D,W) =

1

|D|X

(x,y)2D

L(y, fConvNet(x;W))

• The learning problem is to find the W that minimizes

min

WE(D,W)

• How do we do the optimization?

Training ConvNets: How to optimize

Our optimization problem:

min

WE(D,W) = min

W

1

|D|X

(x,y)2D

L(y, fConvNet(x;W))

• Initialize the network’s parameters randomly to get W(0).

• Update W’s using batch-mode Stochastic Gradient Descent(SGD):

- At iteration t randomly choose a small subset D(t) of D.

- And perform the update with learning rate ↵

(t)

W(t+1)= W(t) � ↵

(t) rWE(D(t),W)

��W(t)

• This procedure allows us to find a local minima of E(D,W).Is this good enough?

Intuition about hardness of training a deep

ConvNet with Backpropagation

Next slides: Deep Learning for Vision: Tricks of the Trade, M. Ranzato,BAVM, Oct ’13

http://bavm2013.splashthat.com/img/events/46439/assets/34a7.ranzato.pdf

28

ConvNets: till 2012

Loss

parameter

Common wisdom: training does not work because we “get stuck in local minima”

29

ConvNets: today

Loss

parameter

Local minima are all similar, there are long plateaus, it can take long time to break symmetries.

w w

input/output invariant to permutations

breaking ties between parameters

WTX

1

Saturating units

30

Like walking on a ridge between valleys

31

ConvNets: today

Loss

parameter

Local minima are all similar, there are long plateaus, it can take long to break symmetries.

Optimization is not the real problem when:– dataset is large– unit do not saturate too much– normalization layer

32

ConvNets: today

Loss

parameter

Today's belief is that the challenge is about:– generalization How many training samples to fit 1B parameters? How many parameters/samples to model spaces with 1M dim.?

– scalability

Regularization during training is very important

Avoid overfitting by:

• Training with large labelled datasets.

• Augment training sets with random jitterings of input.

• Train for a long time with small learning rates.

• Dropout (idea from Geo↵ Hinton)(only classify with and update a random subset of the network at each

training iteration)

• Be vigilant!

When training constantly monitor performance with a validation set.

Source for many labelled images:

ImageNet

ImageNet: Large scale visual recognition challenge(2010-13)

• Classification (+ Localization) Challenge

- 1000 object classes

- 1,431,167 images

• Detection Challenge

- 200 object classes

- 456,191 images

Source: http://image9net.org/challenges/LSVRC/2013

Backpack!

Flute! Strawberry! Traffic!light!

Bathing!cap!Matchs1ck!

Racket!

Sea!lion!

Variety of object classes in ILSVRCVariety!of!object!classes!in!ILSVRC!

Olga!Russakovsky,!Jia!Deng,!Zhiheng!Huang,!Alex!Berg,!Li!FeiPFei!Detec1ng!avocados!to!zucchinis:!what!have!we!done,!and!where!are!we!going?!ICCV!2013!

!!!!!!!!! !!!!!!!hOp://image9net.org/challenges/LSVRC/2012/analysis*


Variety of object classes

Variety!of!object!classes!in!ILSVRC!DET! CLSPLOC!


A revolution in computer vision

ConvNet of Krizhevsky, Sutskever, Hinton 2012

ImageNet Classification with Deep Convolutional NeuralNetworks (NIPS ’12)

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000


Image Classification: Dramatic ILSVRC Results since 2012

• ImageNet Large Scale Visual Recognition Challenge

• 1000 categories, 1.3 Million (⇥10 with data augmentation)labeled training samples

2010 2011 2012 2013 2014

0

10

20

30

28.2

25.8

16.4

11.7

6.7

Classificationerror(%)

ConvNet of Krizhevsky, Sutskever, Hinton 2012

• Method: large convolutional net

- 60M parameters

- Trained with backprop on GPU

- Trained “with all the tricks Yann came up with in the last 20

years, plus dropout” (Hinton, NIPS 2012)

- Rectification, contrast normalization,...

- Softmax output function

• Error Rate on ImageNet: 15% (correct class not in top 5)

• Previous state of the art: 25% error

• Deployed in Google+ Photo Tagging in May 2013

First layer filters learnt

Y LeCun

Object R

ecognition [Krizhevsky, Sutskever, Hinton 2

01

2]

Method: large convolutional net6

50

K neurons, 83

2M

synapses, 60

M param

eters

Trained with backprop on G

PU

Trained “with all the tricks Yann cam

e up with in the

last 20

years, plus dropout” (Hinton, N

IPS 20

12

)

Rectification, contrast norm

alization,...

Error rate: 15

% (w

henever correct class isn't in top 5)

Previous state of the art: 25

% error

A R

EVOLU

TION

IN CO

MPU

TER V

ISION

Acquired by G

oogle in Jan 20

13

Deployed in G

oogle+ Photo Tagging in M

ay 20

13

AlexNet: Object Recognition Results

Y LeCun

Object Recognition [Krizhevsky, Sutskever, Hinton 2012]

Leader Board from ImageNet LSVRC-2014

Name Institution ConvNet Error

GoogLeNet Google X 6.656

VGG Oxford university X 7.337

MSRA Visual Computing Microsoft Research Asia X 8.060

Andrew Howard consultant X 8.111

DeeperVision company X 9.058

For more details check out

http://www.image-net.org/challenges/LSVRC/2014/results.php

ConvNet features are generic

CNN Features: an Astounding Baseline for Recognition

•CNN Features o↵-the-shelf: an Astounding Baseline for

Recognition,A. Sharif Razavin, H. Azizpour, J. Sullivan and S. Carlsson,

CVPR workshop on Deep Learning 2014.

• Paper’s experimental evaluation shows:

Replace the handcrafted feature pipeline for many tasks

,PDJH 3DUW�$QQRWDWLRQV

/HDUQ�1RUPDOL]HG�

3RVH

([WUDFW�)HDWXUHV5*%��JUDGLHQW��

/%3�

&11�5HSUHVHQWDWLRQ

6906WURQJ�'30

with ConvNet features from one large ConvNet trained onImageNet and IMPROVE RESULTS.

What we mean by a ConvNet Feature

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000


ConvNets ! much better image representation

ObjectClassification

SceneClassification

BirdSubcategorization

FlowersRecognition

Hum

anAttributeDetection

ObjectAttributeDetection

ParisBuildingsRetrieval

OxfordBuildingsRetrieval

SculpturesRetrieval

SceneImageRetrieval

ObjectInstanceRetrieval

40

60

80

100

71.1

64

56.8

80.7

69.9

89.5

74.9

81.7

45.4

81.9

89.3

77.2

69

61.8

86.8

73

91.4

79.5

68

42.3

84.3

91.1

Best state-of-the-art ConvNet o↵-the-shelf + Linear SVM

Source: CNN Features o↵-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.

How to optimize ConvNet representations for transfer learning

When do I need transfer learning?

• Currently ImageNet is one of the few sets of large labelledtraining data for computer vision.

• What do I do?

- I have a visual recognition task that di↵ers from image

classification

- I have limited labelled training data.

But I still want to use a deep ConvNet representation.

Factors that influence a ConvNet’s representation

Target

image

Source

ConvNet

Target ConvNet

RepresentationSVM

Target

label

layer?dim. reduction?

spatial pooling?fine-tuning?

Backprop with Source images & labelsRandom ConvNet Source ConvNet

network architecture?source task? early stopping?

+

Training of Source ConvNet from scratch

Exploit Source ConvNet for Target Task

Can order visual recognition tasks relative to ImageNet

Task’s distance from ImageNet increases ��!?????????y

Image Classification Attribute Detect. Fine-grained Recog. Compositional Instance Retrieval

PASCAL VOC Object H3D human attrib. Cat&Dog breeds VOC Human Act. Holiday scenes

MIT 67 Indoor Scenes Object attrib. Bird subordinate Stanford 40 Act. Paris buildings

SUN 397 Scene SUN scene attrib. 102 Flowers Visual Phrases Sculptures

Best practices for a ConvNet rep. for transfer learning

Target task

FactorSource taskImageNet · · ·

FineGrainedrecognition · · ·

Instance xxretrieval

Early stopping Don’t do it

Fine-tuning Yes, more improvement with more labelled data

Network depth As deep as possible1

Network width Wider Moderately wide

Dim. reduction Original dim Reduced dim

Rep. layer Later layers Earlier layers

1In general the network should be as deep as possible but in the final experiments a couple of the instance

retrieval tasks defied this advice!

Gains to be made by optimizing these parameters for a task

VOC

MIT

SCEN

ESU

N

SceneAtt

ObjAtt

Hum

anAtt

Petbreed

BirdSubord

Flowers

VOCAction

StfdAction

Vis.Phrase

Holidays

UKB

OxfordParis

Sculpture30

405060708090100

Best non-ConvNet Deep Standard Deep Optimal

From first fully-connected layer can regress to local spatialinformation

Facial Landmarks via Linear Regression

RGB via Linear Regression

Segmentation via Linear Regression

computer vision group kthfiles.meetup.com/18450611/1_10-march-15-josephine-talk2...convnets ! much...

Documents