l13 modern cnns training - cc.gatech.edu · case studies •there are several generations of...

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – (Finish) Modern CNNs– Detection & Segmentation architectures

Administrivia• PS2/HW2 out today!

– Due: 03/03, 11:55pm– Less involved than PS1/HW1

• Please send project proposal!– Get feedback before next week

(C) Dhruv Batra and Zsolt Kira 2

Last Time

Handwriting Recognition Example

Translation Invariance

Some Rotation Invariance

Some Scale Invariance

Case Studies• There are several generations of ConvNets

– 2012 – 2014: AlexNet, ZNet, VGGNet• Conv-Relu, Pooling, Fully connected, Softmax• Deeper ones (VGGNet) tend to do better

– 2014• Fully-convolutional networks for semantic segmentation• Matrix outputs rather than just one probability distribution

– 2014-2016• Fully-convolutional networks for classification• Less parameters, faster than comparable Gen1 networks• GoogleNet, ResNet

– 2014-2016• Detection layers (proposals)• Caption generation (combine with RNNs for language)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

AlexNet: 60M params

ZNet: 75M

VGG: 138M

GoogleNet: 5M

Importance of Depth

• After a while, adding depth decreases performance

• At first, vanishing/exploding gradients

• normalized initialization• Batch normalization• 2nd order methods

• Then, optimization limitation– Deeper network should

be able to mimic shallow ones

Localization and Detection

Class ScoresCat: 0.9Dog: 0.05Car: 0.01...

So far: Image Classification

This image is CC0 public domain Vector:4096

Fully-Connected:4096 to 1000

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Computer Vision Tasks

Classification + Localization

CLS - ImageNet

Idea 1: Localization as Regression

Per-Class vs. Class Agnostic

Where to attach?

Multiple Objects

Human Pose Estimation

Sliding Window: Overfeat

Why aren’t boxes across grid?Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Detection as Classification

R-CNN Training

Datasets

Evaluation Metric

R-CNN Results

Region of Interest (ROI) Pooling

Fast R-CNN Results

Slide Credit: Yacov Hel-Or

https://www.youtube.com/watch?v=KYNDzlcQMWA

Joint Coordinates

Also used for pose estimation

Semantic Segmentation

Label each pixel in the image with a category label

Don’t differentiate instances, only care about pixels

This image is CC0 public domain

Semantic Segmentation Idea: Sliding Window

Full image

Extract patchClassify center pixel with CNN

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Semantic Segmentation Idea: Sliding Window

Full image

Extract patchClassify center pixel with CNN

GrassProblem: Very inefficient! Not reusing shared features between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Semantic Segmentation Idea: Fully Convolutional

Input:3 x H x W

Convolutions:D x H x W

Conv Conv Conv Conv

Scores:C x H x W

argmax

Predictions:H x W

Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Input:3 x H x W

Convolutions:D x H x W

Conv Conv Conv Conv

Scores:C x H x W

argmax

Predictions:H x W

Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Problem: convolutions at original image resolution will be very expensive ...

Dilated Convolutions

(C) Dhruv Batra 98Figure Credit: Dumoulin and Visin, https://arxiv.org/pdf/1603.07285.pdf

Dilated Convolutions (Examples)

(C) Dhruv Batra 99

(C) Dhruv Batra 100Figure Credit: Yu and Koltun, ICLR16

Different Idea: Still Classify, Change Test Time

• Can run on an image of any size!

(C) Dhruv Batra 101Figure Credit: [Long, Shelhamer, Darrell CVPR15]

Better Semantic Segmentation Idea: Fully Convolutional with Bottleneck

Input:3 x H x W Predictions:

Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!

High-res:D1 x H/2 x W/2

Med-res:D2 x H/4 x W/4

Low-res:D3 x H/4 x W/4

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Downsampling:Pooling, strided convolution

Upsampling:???

In-Network upsampling: “Unpooling”

Input: 2 x 2 Output: 4 x 4

1 1 2 2

3 3 4 4

Nearest Neighbor

1 0 2 0

0 0 0 0

3 0 4 0

0 0 0 0

“Bed of Nails”

In-Network upsampling: “Max Unpooling”

Input: 4 x 4

1 2 6 3

3 5 2 1

1 2 2 1

7 3 4 8

0 0 2 0

0 1 0 0

0 0 0 0

3 0 0 4

Max UnpoolingUse positions from pooling layer

Max PoolingRemember which element was max!

… Rest of the network

Output: 2 x 2

Corresponding pairs of downsampling and upsampling layers

Learnable Upsampling: Transpose Convolution

Recall:Typical 3 x 3 convolution, stride 1 pad 1

Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product between filter and input

Filter moves 2 pixels in the input for every one pixel in the output

Stride gives ratio between movement in input and output

3 x 3 transpose convolution, stride 2 pad 1

Input gives weight for filter

Sum where output overlaps

Filter moves 2 pixels in the output for every one pixel in the input

Stride gives ratio between movement in output and input

Sum where output overlaps

Filter moves 2 pixels in the output for every one pixel in the input

Stride gives ratio between movement in output and input

Other names:-Deconvolution (bad)-Upconvolution-Fractionally stridedconvolution-Backward stridedconvolution

Transpose Convolution: 1D Example

az + bx

Input FilterOutput

Output contains copies of the filter weighted by the input, summing at where at overlaps in the output

Need to crop one pixel from output to make output exactly 2x input

Transposed Convolution• https://distill.pub/2016/deconv-checkerboard/

(C) Dhruv Batra 117

Downsampling:Pooling, strided convolution

Upsampling:Unpooling or strided transpose convolution

skip layers

interp + sum

dense output119

end-to-end, joint learningof semantics and location

Slide credit: Jonathan Long

Next Time• Guest lecture by Zhile Ren (Dhruv Batra post-doc) on 3D

l13 modern cnns training - cc.gatech.edu · case studies •there are several generations of...

Documents

matlab expo 2018 munich...best documentation ever! $$$...

convolutional neural networks: a comprehensive survey ·...

weakly supervised object detection (wsod) · diba cvpr 2017...

xbee™ znet 2.5/xbee-pro™ znet 2.5 oem rf modules...

g2denet: global gaussian distribution embedding network...

alexnet -...

a quick introduction to znet technologies pvt. ltd

visualizing and comparing alexnet and vgg using ... ›...

lecture 9: cnn architectures - artificial...

1 autozone znet manual - weeblyaustinnichols.weebly.com ›...

[商品カタログ]znet town...

vapor-phase metalation by atomic layer deposition in a...

xbee znet 2.5 (formerly series 2) - 1 mw, wire antenna

4. xbee znet 2.5 networks

brain tumor and intracranial haemorrhage feature...

ml with tensorflow-new - github pages · fei-fei li &...

real time programme znet android for smartphone › docs ›...

benchmarking deep learning frameworks for the ... ·...

the alexnet-resnet-inception network for classifying fruit...

post-event report... · rigalo roshan saltwater cables...