computer vision group kthfiles.meetup.com/18450611/1_10-march-15-josephine-talk2...convnets ! much...

78
Deep Convolutional Networks & Computer Vision J. Sullivan, H. Azizpour, A. S. Razavian, A. Maki and S. Carlsson Computer Vision Group, KTH. March 10, 2015

Upload: others

Post on 21-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Deep Convolutional Networks & Computer Vision

J. Sullivan, H. Azizpour, A. S. Razavian, A. Maki and S. Carlsson

Computer Vision Group,

KTH.

March 10, 2015

Page 2: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

What Deep Learning has done for Computer Vision?

Deep Learning has resulted in

1. much better automatic

- visual image classification and

- object detection,

2. much more powerful generic image representations.

Page 3: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

What ConvNets have done for Computer Vision?

ConvNets have resulted in

1. much better automatic

- visual image classification and

- object detection,

2. much more powerful generic image representations.

Page 4: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Image Classification Task: ILSVRCILSVRC!Task!2:!Classifica1on!

Output:*Scale!TPshirt!

Steel!drum!Drums1ck!Mud!turtle!

Steel!drum!

✔! ✗!Output:*Scale!TPshirt!

Giant!panda!Drums1ck!Mud!turtle!

Error =1

100, 000

X

100, 000 images

1(incorrect on image i)

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

Page 5: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNets ! much better image classification

2010 2011 2012 2013 2014

0

10

20

30

28.2

25.8

16.4

11.7

6.7

Classificationerror(%)

Performance of winning entry in ILSVRC competitions (2010-14).

Red indicates when deep ConvNets were introduced.

Page 6: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Pascal VOC: Object Detection

PASCAL(VOC(200562012(

Classifica>on:=person,=motorcycle=Detec4on( Segmenta4on(

Person=

Motorcycle(

Ac>on:=riding=bicycle=

Everingham,(Van(Gool,(Williams,(Winn(and(Zisserman.(The(PASCAL(Visual(Object(Classes((VOC)(Challenge.(IJCV(2010.(

20=object=classes = =22,591=images=

Page 7: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNets ! much better object detection

2007 2008 2009 2010 2011 2012 2013 2014 2015

10

20

30

40

50

60

70

80

Deep learning

Year

Accuracy

plant

person

chair

cat

car

aeroplane

all classes

Progress of object detection for the Pascal VOC 2007 challenge.

Page 8: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNets ! much better image representation

ObjectClassification

SceneClassification

BirdSubcategorization

FlowersRecognition

Hum

anAttributeDetection

ObjectAttributeDetection

ParisBuildingsRetrieval

OxfordBuildingsRetrieval

SculpturesRetrieval

SceneImageRetrieval

ObjectInstanceRetrieval

40

60

80

100

71.1

64

56.8

80.7

69.9

89.5

74.9

81.7

45.4

81.9

89.3

77.2

69

61.8

86.8

73

91.4

79.5

68

42.3

84.3

91.1

Best state-of-the-art ConvNet o↵-the-shelf + Linear SVM

Source: CNN Features o↵-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.

Page 9: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Reason for jump in performance:

Learn feature hierarchies from the data

Page 10: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Modern Visual Recognition Systems

1. Training Phase

- Gather labelled training data.

- Extract a feature representation for each training example.

- Construct a decision boundary.

2. Test Phase

- Extract feature representation from the test example.

- Compare to the learnt decision boundary.

Page 11: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Modern Visual Recognition Systems

1. Training Phase

- Gather labelled training data.

- Extract a feature representation for each training example.

- Construct a decision boundary.

2. Test Phase

- Extract feature representation from the test example.

- Compare to the learnt decision boundary.

It’s just supervised learning.

Page 12: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Is it a bike or a face?

?

Page 13: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Construct a decision boundary

Decision Boundary

Page 14: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

The two extremes of feature extraction

Ideal features Far from ideal

Page 15: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

The two extremes of feature extraction

Ideal features Far from ideal

Supervised Deep Learning allows you to learn ideal features.

Page 16: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Learning Representations/Features

Traditional Pattern Recognition: Fixed/Handcrafted feature extraction

f

FeatureExtractor

TrainableClassifier

Modern Pattern Recognition: Unsupervised mid-level features

f

FeatureExtractor

TrainableClassifier

Mid-levelFeatures

Deep Learning: Train hierarchical representations

f

Low-levelFeatures

Mid-levelFeatures

High-levelFeatures

TrainableClassifier

Source: Talk Computer Perception with Deep Learning by Yann LeCun

Page 17: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Key Properties of Deep Learning

Provides a mechanism to:

• Learn a highly non-linear function.

• Learn it from data.

• Build feature hierarchies

- Distributed representations

- Compositionality

• Perform end-to-end learning.

Page 18: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

How? Convolutional Networks

Page 19: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Convolutional Networks

• Are deployed in many practical applicationsImage recognition, speech recognition, Google’s and Baidu’s photo

taggers

• Have won several competitionsImageNet, Kaggle Facial Expression, Kaggle Multimodal Learning,

German Tra�c Signs, Connectomics, Handwriting...

• Are applicable to array data where nearby values are correlatedImages, sound, time-frequency representations, video, volumetric

images, RGB-Depth images....

Source: Talk Computer Perception with Deep Learning by Yann LeCun

Page 20: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Convolutional NetworkY LeCun

The Convolutional Net Model (Multistage Hubel-Wiesel system)

pooling subsampling

“Simple cells”“Complex cells”

Multiple convolutions

Retinotopic Feature Maps

[LeCun et al. 89][LeCun et al. 98]

Training is supervisedWith stochastic gradient descent

• Training is supervised and with stochastic gradientdescent.

• LeCun et al. ’89, ’98

Source: Talk Computer Perception with Deep Learning by Yann LeCun

Page 21: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNets: History

• Fukushima 1980: designed network with same basic structure butdid not train by backpropagation.

• LeCun from late 80s: figured out backpropagation for ConvNets,popularized and deployed ConvNets for OCR applications etc.

• Poggio from 1999: same basic structure but learning is restrictedto top layer (k-means at second stage).

• LeCun from 2006: unsupervised feature learning.

• DiCarlo from 2008: large scale experiments, normalization layer.

• LeCun from 2009: harsher non-linearities, normalization layer,learning unsupervised and supervised.

• Mallat from 2011: provides a theory behind the architecture.

• Hinton 2012: use bigger nets, GPUs, more data, purely supervised.

Page 22: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

26

TIM

E

Convolutional Neural Net 2012

Convolutional Neural Net 1998

Convolutional Neural Net 1988

Q.: Did we make any prgress since then?

A.: The main reason for the breakthrough is: data and GPU, but we have also made networks deeper and more non-linear.

Reasons for breakthrough now:

• Data and GPUs,

• Networks have been made deeper.

Page 23: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Modern Convolutional Network

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Alex Net 2012

Page 24: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Convolutional Networks for RGB Images: The Basic Operations

Page 25: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Convolution Operation

• Input:

- a set of 2d feature maps x1:m = {x1, . . . ,xm}- each xi has size W ⇥W

• Convolutional Parameters:

- a set of 2d convolutional kernels k1:m = {k1, . . . ,km}- each ki has size (2w + 1)⇥ (2w + 1) and

- a bias term b

224⇥224⇥3224⇥224⇥48

Input Image Convolutionresponse maps

Page 26: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Convolution Operation

Convolutional Operator:

• Define conv(·, ·, ·), the convolution of x1:m with k1:m, as:

conv(x1:m,k1:m, b) =

mX

i=1

(xi ⇤ ki) + b

the 2d convolution xi ⇤ ki returns a 2d map with (x, y)th entry:

x

i,x,y

=wX

x

0=�w

wX

y

0=�w

k

i,x

0+w+1,y0+w+1 xi,x+x

0,y+y

0

224⇥224⇥3224⇥224⇥48

Input Image Convolutionresponse maps

Page 27: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Remember 2D convolution

Image f

Spatial domain

Origin

(x, y)w(-1, -1) w(-1,0) w(-1,1)

w(0,0)

w(1,0) w(1,1)

w(0,1)w(0,-1)

w(1,-1)

Filter coefficientsf(x-1, y-1) f(x-1,y) f(x-1,y+1)

f(x,y)

f(x+1,y) f(x+1,y+1)

f(x,y+1)f(x,y-1)

f(x+1,y--1)

Pixels of image

section under filter

g(x, y) =

aX

s=�a

bX

t=�b

w(s, t) f(x+ s, y + t)

Page 28: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Next non-linear activation and then max-pool

Create a new 2d feature map by applying two more operators:

˜

x = pool(�(conv(x1:m,k1:m, b)))

where- �(·) is a non-linear function typically �(x) = max(0, x)

- pool(·) represents a local max-pooling operator.

224⇥224⇥3224⇥224⇥48

55⇥55⇥48

Input Image Activationresponse maps

Max-pooledresponse maps

Page 29: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

From one convolutional layer to the next

• At convolutional layer l have a set of 2d feature maps

x

(l)1:ml

= {x(l)1 , . . . ,x

(l)ml

}

• Have multiple sets of convolutional kernels k(l+1)j,1:ml

, j = 1, . . . ,ml+1

• For each kernel set k(l+1)j,1:ml

create a new set of 2d feature map:

x

(l+1)j = pool

⇣�

⇣conv

⇣x

(l)1:ml

,k

(l+1)j,1:ml

, b

(l+1)j

⌘⌘⌘

Page 30: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Convolve ! Activation ! Max-pool

55⇥55⇥4855⇥55⇥128

27⇥27⇥128

Layer 1 output Activationresponse maps

Max-pooledresponse maps

For j = 1, . . . ,ml+1

- Convolve current response maps with k

(l+1)j,1:ml

:

ˆ

x

(l+1)j = conv

⇣x

(l)1:ml

,k

(l+1)j,1:ml

, b

(l+1)j

- Non-linear activation:

z

(l+1)j = �

⇣ˆ

x

(l+1)j

- Max-pool:

x

(l+1)j = pool

⇣z

(l+1)j

Page 31: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

AlexNet 2012

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Page 32: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

1st fully connected layer

For j = 1, . . . ,mlc

+1

x

(lc

+1)j = max

ml

cX

i=1

w

(lc

+1)j,i · x(l

c

)i + b

(lc

)j , 0

!

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Page 33: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

2nd fully connected layer

For j = 1, . . . ,mlc

+2

x

(lc

+2)j = max

⇣w

(lc

+2)j · x(l

c

+1)+ b

(lc

+2)j , 0

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Page 34: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Output layer: soft-max operator

• For j = 1, . . . , C:

o

0j = w

(lc

+3)j · x(l

c

+2)+ b

(lc

+3)j , or =

exp(o

0r)

PCj=1 exp

⇣o

0j

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Page 35: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Parameters of the model

• Filter parameters:

- Convolutional layers

For l = 1, . . . , lc

k

(l)j,1:ml�1

1 j ml, each k

(l)j,i has size wl ⇥ wl

- Fully connnected layers

* First fully connected layer:

w

(lc+1)j,i , 1 j mlc+1, 1 i mlc ,

each w

(lc+1)j,i and x

(lc)i have equal size.

* Subsequent fully connected layers:

For l = lc + 1, . . . , lc + L

w

(l)j , 1 j ml, each w

(l)j has size ml�1 ⇥ml�1

Page 36: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Training Convolutional Networks

• Set-up - Supervised learning

- For RGB image x set ConvNet’s first set of 2d feature maps:

x

(0)1:3 = {xred channel,xgreen channel,xblue channel}

- Have a set D of labelled training images i.e. have many (x,y)

- To learn the network’s parameters must link value of

W = {Wconvolutional,Wfully connected}

to network’s prediction performance on D.

Page 37: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Training ConvNets: Measuring Performance

- Remember a ConvNet represents a function:

fConvNet : [0, 1]W⇥W⇥3 ⇥ Rp ! [0, 1]

M

so for input x the function fConvNet predicts its label

fConvNet(x;W) =

ˆ

y

- Use a loss function to measure the error in fConvNet(x;W)’spredicted label for input x in D.

- Loss function typically has the property that

L(y, y) increases as ky � yk increases

- Cross-Entropy loss frequently used

L(y, fConvNet(x;W)) = �MX

j=1

yj log (yj)

Page 38: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Training ConvNets: The Optimization Problem

• Define performance of a network with parameters W on D as

E(D,W) =

1

|D|X

(x,y)2D

L(y, fConvNet(x;W))

• The learning problem is to find the W that minimizes

min

WE(D,W)

• How do we do the optimization?

Page 39: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Training ConvNets: How to optimize

Our optimization problem:

min

WE(D,W) = min

W

1

|D|X

(x,y)2D

L(y, fConvNet(x;W))

• Initialize the network’s parameters randomly to get W(0).

• Update W’s using batch-mode Stochastic Gradient Descent(SGD):

- At iteration t randomly choose a small subset D(t) of D.

- And perform the update with learning rate ↵

(t)

W(t+1)= W(t) � ↵

(t) rWE(D(t),W)

���W(t)

• This procedure allows us to find a local minima of E(D,W).Is this good enough?

Page 40: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Training ConvNets: How to optimize

Our optimization problem:

min

WE(D,W) = min

W

1

|D|X

(x,y)2D

L(y, fConvNet(x;W))

• Initialize the network’s parameters randomly to get W(0).

• Update W’s using batch-mode Stochastic Gradient Descent(SGD):

- At iteration t randomly choose a small subset D(t) of D.

- And perform the update with learning rate ↵

(t)

W(t+1)= W(t) � ↵

(t) rWE(D(t),W)

���W(t)

• This procedure allows us to find a local minima of E(D,W).Is this good enough?

Page 41: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Intuition about hardness of training a deep

ConvNet with Backpropagation

Next slides: Deep Learning for Vision: Tricks of the Trade, M. Ranzato,BAVM, Oct ’13

Page 42: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

28

ConvNets: till 2012

Loss

parameter

Common wisdom: training does not work because we “get stuck in local minima”

Page 43: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

29

ConvNets: today

Loss

parameter

Local minima are all similar, there are long plateaus, it can take long time to break symmetries.

w w

input/output invariant to permutations

breaking ties between parameters

WTX

1

Saturating units

Page 44: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

30

Like walking on a ridge between valleys

Page 45: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

31

ConvNets: today

Loss

parameter

Local minima are all similar, there are long plateaus, it can take long to break symmetries.

Optimization is not the real problem when:– dataset is large– unit do not saturate too much– normalization layer

Page 46: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

32

ConvNets: today

Loss

parameter

Today's belief is that the challenge is about:– generalization How many training samples to fit 1B parameters? How many parameters/samples to model spaces with 1M dim.?

– scalability

Page 47: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Regularization during training is very important

Avoid overfitting by:

• Training with large labelled datasets.

• Augment training sets with random jitterings of input.

• Train for a long time with small learning rates.

• Dropout (idea from Geo↵ Hinton)(only classify with and update a random subset of the network at each

training iteration)

• Be vigilant!

When training constantly monitor performance with a validation set.

Page 48: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Source for many labelled images:

ImageNet

Page 49: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ImageNet: Large scale visual recognition challenge(2010-13)

• Classification (+ Localization) Challenge

- 1000 object classes

- 1,431,167 images

• Detection Challenge

- 200 object classes

- 456,191 images

Source: http://image9net.org/challenges/LSVRC/2013

Page 50: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Backpack!

Flute! Strawberry! Traffic!light!

Bathing!cap!Matchs1ck!

Racket!

Sea!lion!

Page 51: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Variety of object classes in ILSVRCVariety!of!object!classes!in!ILSVRC!

Olga!Russakovsky,!Jia!Deng,!Zhiheng!Huang,!Alex!Berg,!Li!FeiPFei!Detec1ng!avocados!to!zucchinis:!what!have!we!done,!and!where!are!we!going?!ICCV!2013!

!!!!!!!!! !!!!!!!hOp://image9net.org/challenges/LSVRC/2012/analysis*

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

Page 52: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Variety of object classes

Variety!of!object!classes!in!ILSVRC!DET! CLSPLOC!

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

Page 53: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

A revolution in computer vision

Page 54: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNet of Krizhevsky, Sutskever, Hinton 2012

ImageNet Classification with Deep Convolutional NeuralNetworks (NIPS ’12)

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Page 55: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Image Classification: Dramatic ILSVRC Results since 2012

• ImageNet Large Scale Visual Recognition Challenge

• 1000 categories, 1.3 Million (⇥10 with data augmentation)labeled training samples

2010 2011 2012 2013 2014

0

10

20

30

28.2

25.8

16.4

11.7

6.7

Classificationerror(%)

Page 56: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNet of Krizhevsky, Sutskever, Hinton 2012

• Method: large convolutional net

- 60M parameters

- Trained with backprop on GPU

- Trained “with all the tricks Yann came up with in the last 20

years, plus dropout” (Hinton, NIPS 2012)

- Rectification, contrast normalization,...

- Softmax output function

• Error Rate on ImageNet: 15% (correct class not in top 5)

• Previous state of the art: 25% error

• Deployed in Google+ Photo Tagging in May 2013

Page 57: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

First layer filters learnt

Y LeCun

Object R

ecognition [Krizhevsky, Sutskever, Hinton 2

01

2]

Method: large convolutional net6

50

K neurons, 83

2M

synapses, 60

M param

eters

Trained with backprop on G

PU

Trained “with all the tricks Yann cam

e up with in the

last 20

years, plus dropout” (Hinton, N

IPS 20

12

)

Rectification, contrast norm

alization,...

Error rate: 15

% (w

henever correct class isn't in top 5)

Previous state of the art: 25

% error

A R

EVOLU

TION

IN CO

MPU

TER V

ISION

Acquired by G

oogle in Jan 20

13

Deployed in G

oogle+ Photo Tagging in M

ay 20

13

Page 58: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

AlexNet: Object Recognition Results

Y LeCun

Object Recognition [Krizhevsky, Sutskever, Hinton 2012]

Page 59: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

AlexNet: Object Recognition Results

Y LeCun

Object Recognition [Krizhevsky, Sutskever, Hinton 2012]

Page 60: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

AlexNet: Object Recognition Results

Y LeCun

Object Recognition [Krizhevsky, Sutskever, Hinton 2012]

Page 61: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Leader Board from ImageNet LSVRC-2014

Name Institution ConvNet Error

GoogLeNet Google X 6.656

VGG Oxford university X 7.337

MSRA Visual Computing Microsoft Research Asia X 8.060

Andrew Howard consultant X 8.111

DeeperVision company X 9.058

For more details check out

http://www.image-net.org/challenges/LSVRC/2014/results.php

Page 62: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNet features are generic

Page 63: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

CNN Features: an Astounding Baseline for Recognition

•CNN Features o↵-the-shelf: an Astounding Baseline for

Recognition,A. Sharif Razavin, H. Azizpour, J. Sullivan and S. Carlsson,

CVPR workshop on Deep Learning 2014.

• Paper’s experimental evaluation shows:

Replace the handcrafted feature pipeline for many tasks

,PDJH 3DUW�$QQRWDWLRQV

/HDUQ�1RUPDOL]HG�

3RVH

([WUDFW�)HDWXUHV5*%��JUDGLHQW��

/%3�

&11�5HSUHVHQWDWLRQ

6906WURQJ�'30

with ConvNet features from one large ConvNet trained onImageNet and IMPROVE RESULTS.

Page 64: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

What we mean by a ConvNet Feature

224⇥224⇥3

55⇥55⇥48

27⇥27⇥128

13⇥13⇥192

13⇥13⇥192

13⇥13⇥128

dense

4096

dense

4096

dense

1000

OutputInput Image Fully connected layersConvolutional layers

Page 65: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

ConvNets ! much better image representation

ObjectClassification

SceneClassification

BirdSubcategorization

FlowersRecognition

Hum

anAttributeDetection

ObjectAttributeDetection

ParisBuildingsRetrieval

OxfordBuildingsRetrieval

SculpturesRetrieval

SceneImageRetrieval

ObjectInstanceRetrieval

40

60

80

100

71.1

64

56.8

80.7

69.9

89.5

74.9

81.7

45.4

81.9

89.3

77.2

69

61.8

86.8

73

91.4

79.5

68

42.3

84.3

91.1

Best state-of-the-art ConvNet o↵-the-shelf + Linear SVM

Source: CNN Features o↵-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.

Page 66: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

How to optimize ConvNet representations for transfer learning

Page 67: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

When do I need transfer learning?

• Currently ImageNet is one of the few sets of large labelledtraining data for computer vision.

• What do I do?

- I have a visual recognition task that di↵ers from image

classification

- I have limited labelled training data.

But I still want to use a deep ConvNet representation.

Page 68: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Factors that influence a ConvNet’s representation

Target

image

Source

ConvNet

Target ConvNet

RepresentationSVM

Target

label

layer?dim. reduction?

spatial pooling?fine-tuning?

Backprop with Source images & labelsRandom ConvNet Source ConvNet

network architecture?source task? early stopping?

+

Training of Source ConvNet from scratch

Exploit Source ConvNet for Target Task

Page 69: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Can order visual recognition tasks relative to ImageNet

Task’s distance from ImageNet increases ����������!?????????y

Image Classification Attribute Detect. Fine-grained Recog. Compositional Instance Retrieval

PASCAL VOC Object H3D human attrib. Cat&Dog breeds VOC Human Act. Holiday scenes

MIT 67 Indoor Scenes Object attrib. Bird subordinate Stanford 40 Act. Paris buildings

SUN 397 Scene SUN scene attrib. 102 Flowers Visual Phrases Sculptures

Page 70: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Best practices for a ConvNet rep. for transfer learning

Target task

FactorSource taskImageNet · · ·

FineGrainedrecognition · · ·

Instance xxretrieval

Early stopping Don’t do it

Fine-tuning Yes, more improvement with more labelled data

Network depth As deep as possible1

Network width Wider Moderately wide

Dim. reduction Original dim Reduced dim

Rep. layer Later layers Earlier layers

1In general the network should be as deep as possible but in the final experiments a couple of the instance

retrieval tasks defied this advice!

Page 71: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Gains to be made by optimizing these parameters for a task

VOC

MIT

SCEN

ESU

N

SceneAtt

ObjAtt

Hum

anAtt

Petbreed

BirdSubord

Flowers

VOCAction

StfdAction

Vis.Phrase

Holidays

UKB

OxfordParis

Sculpture30

405060708090100

Best non-ConvNet Deep Standard Deep Optimal

Page 72: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

From first fully-connected layer can regress to local spatialinformation

Page 73: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Facial Landmarks via Linear Regression

Page 74: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

RGB via Linear Regression

Page 75: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Segmentation via Linear Regression

Page 76: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Segmentation via Linear Regression

Page 77: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Segmentation via Linear Regression

Page 78: Computer Vision Group KTHfiles.meetup.com/18450611/1_10-March-15-Josephine-talk2...ConvNets ! much better image classification 2010 2011 2012 2013 2014 0 10 20 30 28.2 25.8 16.4 11.7

Segmentation via Linear Regression