deep learning for computer vision (1/4): image analytics @ lasalle 2016

143
@DocXavi Deep Learning for Computer Vision Image Analytics 5 May 2016 Xavier Giró-i-Nieto Master en Creació Multimedia

Upload: xavier-giro

Post on 16-Apr-2017

829 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

DocXavi

Deep Learning for Computer VisionImage Analytics 5 May 2016

Xavier Giroacute-i-Nieto

Master en Creacioacute Multimedia

2

Densely linked slides

3

IntroductionXavier Giro-i-Nieto

bull Web httpsimatgeupceduwebpeoplexavier-giroAssociate Professor at Universitat Politecnica de Catalunya (UPC)

4

Acknowledgments

5

Acknowledgments

One lecture organized in three parts

6

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 2: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

2

Densely linked slides

3

IntroductionXavier Giro-i-Nieto

bull Web httpsimatgeupceduwebpeoplexavier-giroAssociate Professor at Universitat Politecnica de Catalunya (UPC)

4

Acknowledgments

5

Acknowledgments

One lecture organized in three parts

6

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 3: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

3

IntroductionXavier Giro-i-Nieto

bull Web httpsimatgeupceduwebpeoplexavier-giroAssociate Professor at Universitat Politecnica de Catalunya (UPC)

4

Acknowledgments

5

Acknowledgments

One lecture organized in three parts

6

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 4: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

4

Acknowledgments

5

Acknowledgments

One lecture organized in three parts

6

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 5: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

5

Acknowledgments

One lecture organized in three parts

6

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 6: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

One lecture organized in three parts

6

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 7: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

One lecture organized in three parts

7

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 8: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Previously before deep learning

8Slide credit Jose M Agravelvarez

Dog

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 9: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

9Slide credit Jose M Agravelvarez

Dog

Learned Representation

Previously before deep learning

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 10: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Outline for Part I Image Analytics

10

Dog

Learned Representation

Part I End-to-end learning (E2E)

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 11: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

11

Learned Representation

Part I End-to-end learning (E2E)

Task A(eg image classification)

Outline for Part I Image Analytics

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 12: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

12

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 13: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

13

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-the-shelf features

Outline for Part I Image Analytics

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 14: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification Supervised learning

14

Manual Annotations Model

New ImageAutomatic

classification

Training

Test

Anchor

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 15: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification LeNet-5

15

LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 16: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification LeNet-5

16

Demo 3D Visualization of a Convolutional Neural Network

Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 17: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification Similar to LeNet-5

17

Demo Classify MNIST digits with a Convolutional Neural Network

ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 18: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification Databases

18

Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 19: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

19

E2E Classification Databases

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 20: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

20

Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]

E2E Classification Databases

205 scene classes (categories) Images

25M train 205k validation 41k test

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 21: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

21

E2E Classification ImageNet ILSRVC

1000 object classes (categories) Images

12 M train 100k test

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 22: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification ImageNet ILSRVC

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Predict 5 classes

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 23: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Slide credit Rob Fergus (NYU)

Image Classifcation 2012

-98

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23

E2E Classification ILSRVC

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 24: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification AlexNet (Supervision)

24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

Orange

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 25: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 26: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 27: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

27Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 28: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

28Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 29: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

29Image credit Deep learning Tutorial (Stanford University)

E2E Classification AlexNet (Supervision)

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 30: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

30

RectifiedLinearUnit (non-linearity)

f(x) = max(0x)

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 31: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

31

Dot Product

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)

E2E Classification AlexNet (Supervision)

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 32: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Slide credit Rob Fergus (NYU)

32

E2E Classification ImageNet ILSRVC

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 33: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

The development of better convnets is reduced to trial-and-error

33

E2E Classification Visualize ZF

Visualization can help in proposing better architectures

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 34: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo

Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011

34

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 35: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing

Dec

onvN

etC

onvN

et

35

E2E Classification Visualize ZF

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 36: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

36

E2E Classification Visualize ZF

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 37: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

37

E2E Classification Visualize ZF

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 38: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

38

E2E Classification Visualize ZF

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 39: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo

39

E2E Classification Visualize ZF

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 40: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

40

E2E Classification Visualize ZF

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 41: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo

41

E2E Classification Visualize ZF

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 42: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

XX

ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo

42

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 43: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

43

E2E Classification Visualize ZF

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 44: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally

XX

XX

44

E2E Classification Visualize ZF

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 45: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

45

Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space

using our deconvolutional network approachCorresponding image patches

E2E Classification Visualize ZF

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 46: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

46

E2E Classification Visualize ZF

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 47: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

47

E2E Classification Visualize ZF

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 48: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

48

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 49: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49

E2E Classification Visualize ZF

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 50: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50

E2E Classification Visualize ZF

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 51: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

51

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features

AlexNet (Layer 1) Clarifai (Layer 1)

E2E Classification Visualize ZF

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 52: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

52

Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet

AlexNet (Layer 2) Clarifai (Layer 2)

E2E Classification Visualize ZF

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 53: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

53

Regularization with dropout

Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron

Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580

Chicago

E2E Classification Dropout ZF

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 54: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

54

E2E Classification Visualize ZF

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 55: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

55

E2E Classification Ensembles ZF

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 56: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

56

E2E Classification ImageNet ILSRVC

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 57: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

ImageNet Classification 2013

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

-5

Slide credit Rob Fergus (NYU)

57

E2E Classification ImageNet ILSRVC

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 58: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification

58

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 59: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

59Movie Inception (2010)

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 60: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

60

22 layers but 12 times fewer parameters than AlexNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 61: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

61

Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero

Solution

Sparsity

How

Inception modules

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 62: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

62

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 63: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 64: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 65: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet (NiN)

65

3x3 and 5x5 convolutions deal with different scales

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 66: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

66

1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 67: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

67

In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions

Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]

E2E Classification GoogLeNet (NiN)

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 68: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

68

In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 69: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 70: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

70

3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 71: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

71

Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time

and no fully connected layers needed

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 72: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

72

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 73: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 74: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification GoogLeNet

74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 75: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification VGG

75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 76: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification VGG

76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 77: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification VGG 3x3 Stacks

77

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 78: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification VGG

78

Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]

No poolings between some convolutional layers Convolution strides of 1 (no skipping)

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 79: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification

79

36 top 5 errorhellipwith 152 layers

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 80: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification ResNet

80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 81: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification ResNet

81

Deeper networks (34 is deeper than 18) are more difficult to train

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

Thin curves training errorBold curves validation error

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 82: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification ResNet

82

Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 83: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification ResNet

83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 84: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

84

E2E Classification Humans

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 85: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

85

E2E Classification Humans

ldquoIs this a Border terrierrdquo

Crowdsourcing

Yes

No

Binary ground truth annotation from the crowd

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 86: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

86

Annotation Problems

Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28

Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]

Crowdsource loss (03) More than 5 objects classes

E2E Classification Humans

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 87: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

E2E Classification Humans

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 88: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)

Test data collection from one human [interface]

ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo

E2E Classification Humans

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 89: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

E2E Classification Humans

89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)

ResNet

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 90: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

9090

Letrsquos play a game

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 91: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

91

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 92: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

92

What have you seen

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 93: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

93

Tower

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 94: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

94

Tower House

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 95: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

95

Tower House

Rocks

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 96: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

96

E2E Saliency

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 97: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

97

Eye Tracker Mouse Click

Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)

E2E Saliency

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 98: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

98

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 99: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

99

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 100: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

100

TRAIN VALIDATION TEST

10000 5000 5000

6000 926 2000

CAT2000 [Borjirsquo15] 2000 - 2000

MIT300 [Juddrsquo12] 300 - -LargeScale

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 101: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

101

E2E Saliency JuntingNet

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 102: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

102

Upsample + filter

2D map

96x96 2340=48x48

IMAGE INPUT(RGB)

E2E Saliency JuntingNet

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 103: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

103

Upsample + filter

2D map

96x96 2340=48x48

3 CONV LAYERS

E2E Saliency JuntingNet

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 104: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

104

Upsample + filter

2D map

96x96 2340=48x48

2 DENSE LAYERS

E2E Saliency JuntingNet

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 105: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

105

Upsample + filter

2D map

96x96 2340=48x48

E2E Saliency JuntingNet

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 106: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

106

E2E Saliency JuntingNet

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 107: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

107

Loss function Mean Square Error (MSE)

Weight initialization Gaussian distribution

Learning rate 003 to 00001

Mini batch size 128

Training time 7h (SALICON) 4h (iSUN)

Acceleration SGD+ nesterov momentum (09)

Regularisation Maxout norm

GPU NVidia GTX 980

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 108: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

108

Number of iterations (Training time)

Back-propagation with the Euclidean distance Training curve for the SALICON database

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 109: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

109

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 110: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

110

JuntingNetGround TruthPixels

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 111: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

111

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet iSUN

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 112: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

112

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 113: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

113

JuntingNetGround TruthPixels

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 114: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

114

Results from CVPR LSUN Challenge 2015

E2E Saliency JuntingNet SALICON

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 115: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

115

E2E Saliency JuntingNet

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 116: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

116

Part I End-to-end learning (E2E)

Domain BFine-tuned

Learned Representation

Part Irsquo End-to-End Fine-Tuning (FT)

Part I End-to-end learning (E2E)

Domain ALearned Representation

Part I End-to-end learning (E2E)

Transfer

Outline for Part I Image Analytics

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 117: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

117

E2E Fine-tuning

Fine-tuning a pre-trained network

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 118: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

118

E2E Fine-tuning

Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)

Fine-tuning a pre-trained network

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 119: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

119

E2E Fine-tuning Sentiments

CNN

Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 120: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

120

Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

E2E Fine-tuning Sentiments

True positive True negative False positive False negative

Visualizations with fully convolutional networks

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 121: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

121

E2E Fine-tuning Cultural events

ChaLearn Workshop

A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 122: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

122

VGG + Fine tuned

E2E Fine-tuning Saliency prediction

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 123: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

123

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

From scratch

VGG + Fine tuned

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 124: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

124

E2E Fine-tuning Saliency prediction

Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 125: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

125

Task A(eg image classification)

Learned Representation

Part I End-to-end learning (E2E)

Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)

Outline for Part I Image Analytics

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 126: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Off-The-Shelf (OTS) Features

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 127: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Intermediate features can be used as regular visual descriptors for any task

127

Off-The-Shelf (OTS) Features

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 128: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

128

OTS Classification Razavian

Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014

Pascal VOC 2007

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 129: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

129

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

ClassifierL2-normalization

Accuracy+5

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 130: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Three representative architectures considered

AlexNet

ZF

OverFeat

5 days (fast)

3 weeks (slow)

NVIDIA GTX Titan GPU

130

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 131: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

F

C

131

OTS Classification Return of devilData augmentation

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 132: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

132

OTS Classification Return of devil

Fisher Kernels

(FK)

ConvNets (CNN)

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 133: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Color

Gray Scale (GS)

Accuracy-25

133

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 134: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Dimensionality reduction by retraining the last layer to smaller sizes

Accuracy-2

Sizex32

134

OTS Classification Return of devil

Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 135: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Ranking

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599

135

OTS Retrieval

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 136: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Oxford Buildings Inria Holidays UKB

136

OTS Retrieval

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 137: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Pooled from the network from Krizhevsky et al pretrained with images from ImageNet

137

OTS Retrieval FC layers

Summary of the paper by Amaia Salvador on Bitsearch

Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 138: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)

138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014

OTS Retrieval FC layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 139: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139

Convolutional layers have shown better performance than fully connected ones

OTS Retrieval Conv layers

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 140: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

OTS Retrieval Conv layers

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140

Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 141: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Medium memory footprints

Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141

OTS Retrieval Conv layers

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 142: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

142

OTS Summarization

Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015

Clustering based on Euclidean distance over FC7 features from AlexNet

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 143: Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

143

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto