deep learning for computer vision (1/4): image analytics @ lasalle 2016
TRANSCRIPT
DocXavi
Deep Learning for Computer VisionImage Analytics 5 May 2016
Xavier Giroacute-i-Nieto
Master en Creacioacute Multimedia
2
Densely linked slides
3
IntroductionXavier Giro-i-Nieto
bull Web httpsimatgeupceduwebpeoplexavier-giroAssociate Professor at Universitat Politecnica de Catalunya (UPC)
4
Acknowledgments
5
Acknowledgments
One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
2
Densely linked slides
3
IntroductionXavier Giro-i-Nieto
bull Web httpsimatgeupceduwebpeoplexavier-giroAssociate Professor at Universitat Politecnica de Catalunya (UPC)
4
Acknowledgments
5
Acknowledgments
One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
3
IntroductionXavier Giro-i-Nieto
bull Web httpsimatgeupceduwebpeoplexavier-giroAssociate Professor at Universitat Politecnica de Catalunya (UPC)
4
Acknowledgments
5
Acknowledgments
One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
4
Acknowledgments
5
Acknowledgments
One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
5
Acknowledgments
One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
One lecture organized in three parts
6
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
One lecture organized in three parts
7
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Previously before deep learning
8Slide credit Jose M Agravelvarez
Dog
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
9Slide credit Jose M Agravelvarez
Dog
Learned Representation
Previously before deep learning
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Outline for Part I Image Analytics
10
Dog
Learned Representation
Part I End-to-end learning (E2E)
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
11
Learned Representation
Part I End-to-end learning (E2E)
Task A(eg image classification)
Outline for Part I Image Analytics
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
12
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
13
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-the-shelf features
Outline for Part I Image Analytics
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification Supervised learning
14
Manual Annotations Model
New ImageAutomatic
classification
Training
Test
Anchor
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification LeNet-5
15
LeCun Y Bottou L Bengio Y amp Haffner P (1998) Gradient-based learning applied to document recognition Proceedings of the IEEE 86(11) 2278-2324
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification LeNet-5
16
Demo 3D Visualization of a Convolutional Neural Network
Harley Adam W An Interactive Node-Link Visualization of Convolutional Neural Networks In Advances in Visual Computing pp 867-877 Springer International Publishing 2015
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification Similar to LeNet-5
17
Demo Classify MNIST digits with a Convolutional Neural Network
ldquoConvNetJS is a Javascript library for training Deep Learning models (mainly Neural Networks) entirely in your browser Open a tab and youre training No software requirements no compilers no installations no GPUs no sweatrdquo
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification Databases
18
Li Fei-Fei ldquoHow wersquore teaching computers to understand picturesrdquo TEDTalks 2014
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
19
E2E Classification Databases
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
20
Zhou Bolei Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva Learning deep features for scene recognition using places database In Advances in neural information processing systems pp 487-495 2014 [web]
E2E Classification Databases
205 scene classes (categories) Images
25M train 205k validation 41k test
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
21
E2E Classification ImageNet ILSRVC
1000 object classes (categories) Images
12 M train 100k test
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification ImageNet ILSRVC
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Predict 5 classes
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Slide credit Rob Fergus (NYU)
Image Classifcation 2012
-98
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2014) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web] 23
E2E Classification ILSRVC
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification AlexNet (Supervision)
24Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
Orange
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
25Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
26Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
27Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
28Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
29Image credit Deep learning Tutorial (Stanford University)
E2E Classification AlexNet (Supervision)
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
30
RectifiedLinearUnit (non-linearity)
f(x) = max(0x)
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
31
Dot Product
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB-UPC 2015)
E2E Classification AlexNet (Supervision)
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Slide credit Rob Fergus (NYU)
32
E2E Classification ImageNet ILSRVC
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
The development of better convnets is reduced to trial-and-error
33
E2E Classification Visualize ZF
Visualization can help in proposing better architectures
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ldquoA convnet model that uses the same components (filtering pooling) but in reverse so instead of mapping pixels to features does the oppositerdquo
Zeiler Matthew D Graham W Taylor and Rob Fergus Adaptive deconvolutional networks for mid and high level feature learning Computer Vision (ICCV) 2011 IEEE International Conference on IEEE 2011
34
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing
Dec
onvN
etC
onvN
et
35
E2E Classification Visualize ZF
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
36
E2E Classification Visualize ZF
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
37
E2E Classification Visualize ZF
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
38
E2E Classification Visualize ZF
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ldquoTo examine a given convnet activation we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layerrdquo
39
E2E Classification Visualize ZF
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
40
E2E Classification Visualize ZF
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ldquo(i) Unpool In the convnet the max pooling operation is non-invertible however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variablesrdquo
41
E2E Classification Visualize ZF
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
XX
ldquo(ii) Rectification The convnet uses ReLU non-linearities which rectify the feature maps thus ensuring the feature maps are always positiverdquo
42
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
43
E2E Classification Visualize ZF
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ldquo(iii) Filtering The convnet uses learned filters to convolve the feature maps from the previous layer To approximately invert this the deconvnet uses transposed versions of the same filters (as other autoencoder models such as RBMs) but applied to the rectified maps not the output of the layer beneath In practice this means flipping each filter vertically and horizontally
XX
XX
44
E2E Classification Visualize ZF
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
45
Top 9 activations in a random subset of feature maps across the validation data projected down to pixel space
using our deconvolutional network approachCorresponding image patches
E2E Classification Visualize ZF
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
46
E2E Classification Visualize ZF
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
47
E2E Classification Visualize ZF
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
48
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 49
E2E Classification Visualize ZF
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Zeiler M D amp Fergus R (2014) Visualizing and understanding convolutional networks In Computer VisionndashECCV 2014 (pp 818-833) Springer International Publishing 50
E2E Classification Visualize ZF
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
51
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer ldquodead features
AlexNet (Layer 1) Clarifai (Layer 1)
E2E Classification Visualize ZF
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
52
Cleaner features in Clarifai without the aliasing artifacts caused by the stride 4 used in AlexNet
AlexNet (Layer 2) Clarifai (Layer 2)
E2E Classification Visualize ZF
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
53
Regularization with dropout
Reduction of overfitting by setting to zero the output of a portion (typically 50) of each intermediate neuron
Hinton G E Srivastava N Krizhevsky A Sutskever I amp Salakhutdinov R R (2012) Improving neural networks by preventing co-adaptation of feature detectors arXiv preprint arXiv12070580
Chicago
E2E Classification Dropout ZF
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
54
E2E Classification Visualize ZF
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
55
E2E Classification Ensembles ZF
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
56
E2E Classification ImageNet ILSRVC
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
ImageNet Classification 2013
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
-5
Slide credit Rob Fergus (NYU)
57
E2E Classification ImageNet ILSRVC
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification
58
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
59Movie Inception (2010)
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
60
22 layers but 12 times fewer parameters than AlexNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
61
Challenges of going deeper Overfitting due to the increase amount of parameters Inefficient computation if most weights end up close to zero
Solution
Sparsity
How
Inception modules
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
62
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
63Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
64Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet (NiN)
65
3x3 and 5x5 convolutions deal with different scales
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
66
1x1 convolutions does dimensionality reduction (c3ltc2) and accounts for rectified linear units (ReLU)
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
67
In NiN the Cascaded 1x1 Convolutions compute reductions after the convolutions
Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014 [Slides]
E2E Classification GoogLeNet (NiN)
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
68
In GoogLeNet the Cascaded 1x1 Convolutions compute reductions before the expensive 3x3 and 5x5 convolutions
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
69Lin Min Qiang Chen and Shuicheng Yan Network in network ICLR 2014
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
70
3x3 max pooling introduces somewhat spatial invariance and has proven a benefitial effect by adding an alternative parallel path
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
71
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while providing regularization at training time
and no fully connected layers needed
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
72
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
73NVIDIA ldquoNVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challengerdquo (2015)
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification GoogLeNet
74Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions CVPR 2015 [video] [slides] [poster]
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification VGG
75Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification VGG
76Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification VGG 3x3 Stacks
77
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification VGG
78
Simonyan Karen and Andrew Zisserman Very deep convolutional networks for large-scale image recognition International Conference on Learning Representations (2015) [video] [slides] [project]
No poolings between some convolutional layers Convolution strides of 1 (no skipping)
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification
79
36 top 5 errorhellipwith 152 layers
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification ResNet
80He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification ResNet
81
Deeper networks (34 is deeper than 18) are more difficult to train
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
Thin curves training errorBold curves validation error
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification ResNet
82
Residual learning reformulate the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification ResNet
83He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition arXiv preprint arXiv151203385 (2015) [slides]
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
84
E2E Classification Humans
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
85
E2E Classification Humans
ldquoIs this a Border terrierrdquo
Crowdsourcing
Yes
No
Binary ground truth annotation from the crowd
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
86
Annotation Problems
Carlier Axel Amaia Salvador Ferran Cabezas Xavier Giro-i-Nieto Vincent Charvillat and Oge Marques Assessment of crowdsourcing and gamification loss in user-assisted object segmentation Multimedia Tools and Applications (2015) 1-28
Russakovsky O Deng J Su H Krause J Satheesh S Ma S amp Fei-Fei L (2015) Imagenet large scale visual recognition challenge arXiv preprint arXiv14090575 [web]
Crowdsource loss (03) More than 5 objects classes
E2E Classification Humans
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
87Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
E2E Classification Humans
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
88Andrej Karpathy ldquoWhat I learned from competing against a computer on ImageNetrdquo (2014)
Test data collection from one human [interface]
ldquoAww a cute dog Would you like to spend 5 minutes scrolling through 120 breeds of dog to guess what species it is rdquo
E2E Classification Humans
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
E2E Classification Humans
89NVIDIA ldquoMochajl Deep Learning for Juliardquo (2015)
ResNet
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
9090
Letrsquos play a game
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
91
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
92
What have you seen
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
93
Tower
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
94
Tower House
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
95
Tower House
Rocks
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
96
E2E Saliency
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
97
Eye Tracker Mouse Click
Slide credit Junting Pan ldquoVisual Saliency Prediction using Deep Learning Techniquesrdquo (ETSETB 2015)
E2E Saliency
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
98
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
99
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
100
TRAIN VALIDATION TEST
10000 5000 5000
6000 926 2000
CAT2000 [Borjirsquo15] 2000 - 2000
MIT300 [Juddrsquo12] 300 - -LargeScale
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
101
E2E Saliency JuntingNet
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
102
Upsample + filter
2D map
96x96 2340=48x48
IMAGE INPUT(RGB)
E2E Saliency JuntingNet
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
103
Upsample + filter
2D map
96x96 2340=48x48
3 CONV LAYERS
E2E Saliency JuntingNet
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
104
Upsample + filter
2D map
96x96 2340=48x48
2 DENSE LAYERS
E2E Saliency JuntingNet
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
105
Upsample + filter
2D map
96x96 2340=48x48
E2E Saliency JuntingNet
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
106
E2E Saliency JuntingNet
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
107
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 003 to 00001
Mini batch size 128
Training time 7h (SALICON) 4h (iSUN)
Acceleration SGD+ nesterov momentum (09)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
108
Number of iterations (Training time)
Back-propagation with the Euclidean distance Training curve for the SALICON database
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
109
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
110
JuntingNetGround TruthPixels
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
111
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet iSUN
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
112
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
113
JuntingNetGround TruthPixels
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
114
Results from CVPR LSUN Challenge 2015
E2E Saliency JuntingNet SALICON
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
115
E2E Saliency JuntingNet
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction CVPR 2016
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
116
Part I End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part Irsquo End-to-End Fine-Tuning (FT)
Part I End-to-end learning (E2E)
Domain ALearned Representation
Part I End-to-end learning (E2E)
Transfer
Outline for Part I Image Analytics
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
117
E2E Fine-tuning
Fine-tuning a pre-trained network
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
118
E2E Fine-tuning
Slide credit Victor Campos ldquoLayer-wise CNN surgery for Visual Sentiment Predictionrdquo (ETSETB 2015)
Fine-tuning a pre-trained network
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
119
E2E Fine-tuning Sentiments
CNN
Campos Victor Amaia Salvador Xavier Giro-i-Nieto and Brendan Jou Diving Deep into Sentiment Understanding Fine-tuned CNNs for Visual Sentiment Prediction In Proceedings of the 1st International Workshop on Affect amp Sentiment in Multimedia pp 57-62 ACM 2015
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
120
Campos Victor Xavier Giro-i-Nieto and Brendan Jou ldquoFrom pixels to sentimentsrdquo (Submitted)
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
E2E Fine-tuning Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
121
E2E Fine-tuning Cultural events
ChaLearn Workshop
A Salvador Zeppelzauer M Manchon-Vizuete D Calafell-Oroacutes A and Giroacute-i-Nieto X ldquoCultural Event Recognition with Visual ConvNets and Temporal Modelsrdquo in CVPR ChaLearn Looking at People Workshop 2015 2015 [slides]
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
122
VGG + Fine tuned
E2E Fine-tuning Saliency prediction
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
123
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
From scratch
VGG + Fine tuned
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
124
E2E Fine-tuning Saliency prediction
Junting Pan Kevin McGuinness Elisa Sayrol Noel OConnor and Xavier Giro-i-Nieto Shallow and Deep Convolutional Networks for Saliency Prediction In Proceedings of the IEEE International Conference on Computer Vision 2016
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
125
Task A(eg image classification)
Learned Representation
Part I End-to-end learning (E2E)
Task B(eg image retrieval)Part II Off-The-Shelf features (OTS)
Outline for Part I Image Analytics
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
126Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Off-The-Shelf (OTS) Features
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Intermediate features can be used as regular visual descriptors for any task
127
Off-The-Shelf (OTS) Features
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
128
OTS Classification Razavian
Razavian Ali Hossein Azizpour Josephine Sullivan and Stefan Carlsson CNN features off-the-shelf an astounding baseline for recognition CVPRW 2014
Pascal VOC 2007
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
129
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
ClassifierL2-normalization
Accuracy+5
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Three representative architectures considered
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
NVIDIA GTX Titan GPU
130
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
F
C
131
OTS Classification Return of devilData augmentation
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
132
OTS Classification Return of devil
Fisher Kernels
(FK)
ConvNets (CNN)
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Color
Gray Scale (GS)
Accuracy-25
133
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Dimensionality reduction by retraining the last layer to smaller sizes
Accuracy-2
Sizex32
134
OTS Classification Return of devil
Chatfield K Simonyan K Vedaldi A and Zisserman A Return of the devil in the details Delving deep into convolutional nets BMVC 2014
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Ranking
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014 Springer International Publishing 2014 584-599
135
OTS Retrieval
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Oxford Buildings Inria Holidays UKB
136
OTS Retrieval
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Pooled from the network from Krizhevsky et al pretrained with images from ImageNet
137
OTS Retrieval FC layers
Summary of the paper by Amaia Salvador on Bitsearch
Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Off-the-shelf CNN descriptors from fully connected layers show useful but not superior (wrt FV VLAD Sparse Coding)
138Babenko Artem et al Neural codes for image retrieval Computer VisionndashECCV 2014
OTS Retrieval FC layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015139
Convolutional layers have shown better performance than fully connected ones
OTS Retrieval Conv layers
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
OTS Retrieval Conv layers
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015140
Spatial Search (extract N local descriptor from predefined locations) increases performance at computational cost
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Medium memory footprints
Razavian et al A baseline for visual instance retrieval with deep convolutional networks ICLR 2015141
OTS Retrieval Conv layers
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
142
OTS Summarization
Bolantildeos M Mestre R Talavera E Giroacute-i-Nieto X Radeva P Visual Summary of Egocentric Photostreams by Representative Keyframes In IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015 Turin Italy 2015
Clustering based on Euclidean distance over FC7 features from AlexNet
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
143
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto