convolutional neural networks (cnn) · convolutional neural networks (cnn) •motivation –the...

115
Convolutional Neural Networks (CNN) Prof. Seungchul Lee Industrial AI Lab.

Upload: others

Post on 21-May-2020

33 views

Category:

Documents


0 download

TRANSCRIPT

Convolutional Neural Networks (CNN)

Prof. Seungchul Lee

Industrial AI Lab.

Convolution

2

Convolution

• Integral (or sum) of the product of the two signals after one is reversed and shifted

• Cross correlation and convolution

3

1D Convolution

• (actually cross-correlation)

4Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1Input

1D Convolution

• (actually cross-correlation)

5Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

W

1 3 0 -1

w

Input

Kernel

Output

L = W-w+1

1D Convolution

• (actually cross-correlation)

6Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

Input

Kernel

Output

L = W-w+1

7

W

w

1D Convolution

• (actually cross-correlation)

7Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

Input

Kernel

Output

L = W-w+1

7 9

W

w

1D Convolution

• (actually cross-correlation)

8Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

w

Input

Kernel

Output

L = W-w+1

7 9 12

W

1D Convolution

• (actually cross-correlation)

9Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

w

Input

Kernel

Output

L = W-w+1

7 9 12 2

W

1D Convolution

• (actually cross-correlation)

10Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

w

Input

Kernel

Output

L = W-w+1

7 9 12 2 -1

W

1D Convolution

• (actually cross-correlation)

11Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

w

Input

Kernel

Output

L = W-w+1

7 9 12 2 -1 0

W

1D Convolution

• (actually cross-correlation)

12Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

1 3 0 -1

w

Input

Kernel

Output

L = W-w+1

7 9 12 2 -1 0 6

W

Example: 1D Convolution

13

De-noising a Piecewise Smooth Signal

• Moving average (MA) filter– A moving average is the unweighted mean of the previous 𝑚 data

• Convolution with 1

𝑚,1

𝑚, ⋯ ,

1

𝑚

• Low-pass filter in time domain

14

De-noising a Piecewise Smooth Signal

15

Edge Detection

16

Smoothing and Detection of Abrupt Changes

17

Images

18

Images Are Numbers

19Source: 6.S191 Intro. to Deep Learning at MIT

Images Are Numbers

20Source: 6.S191 Intro. to Deep Learning at MIT

Images Are Numbers

21Source: 6.S191 Intro. to Deep Learning at MIT

Images

22

Original image R G B

Gray image

2D Convolution

23

Convolution on Image (= Convolution in 2D)

• Filter (or Kernel)– Discrete convolution can be viewed as element-wise multiplication by a matrix

– Modify or enhance an image by filtering

– Filter images to emphasize certain features or remove other features

– Filtering includes smoothing, sharpening and edge enhancement

24

Image Kernel Output

Convolution on Image (= Convolution in 2D)

25

Convolution on Image

26

Kernel

1 0 1

1 0 1

1 0 1

Image Kernel Output

1 1 1

0 0 0

1 1 1

Convolution on Image

27

Gaussian Filter: Blurring

28

How to Find the Right Kernels

• We learn many different kernels that make specific effect on images

• Let’s apply an opposite approach

• We are not designing the kernel, but are learning the kernel from data

• Can learn feature extractor from data using a deep learning framework

29

Learning Visual Features

30

Convolutional Neural Networks (CNN)

• Motivation– The bird occupies a local area and looks the same in different parts of an image. We should

construct neural networks which exploit these properties.

31

ANN Structure for Object Detection in Image

• Does not seem the best

• Did not make use of the fact that we are dealing with images

32

bird

Fully Connected Neural Network

• Input– 2D image

– Vector of pixel values

• Fully connected– Connect neuron in hidden layer to all neurons in input layer

– No spatial information

– Spatial organization of the input is destroyed by flatten

– And many, many parameters !

• How can we use spatial structure in the input to inform the architecture of the network?

33Source: 6.S191 Intro. to Deep Learning at MIT

Convolution Mask + Neural Network

34

Locality

• Locality: objects tend to have a local spatial support

– fully-connected layer → locally-connected layer

35

Locality

• Locality: objects tend to have a local spatial support

– fully-connected layer → locally-connected layer

36

We are not designing the kernel, but are learning the kernel from data→ Learning feature extractor from data

𝜔1 𝜔2 𝜔3

𝜔4 𝜔5 𝜔6

𝜔7 𝜔8 𝜔9

Deep Artificial Neural Networks

• Universal function approximator– Simple nonlinear neurons

– Linear connected networks

• Hidden layers– Autonomous feature learning

37

Class 2Class 1

Convolutional Neural Networks

• Structure– Weight sharing

– Local connectivity

– Typically have sparse interactions

• Optimization– Smaller searching space

38

Class 2Class 1

Multiple Filters (or Kernels)

39

Channels

• Colored image = tensor of shape (height, width, channels)

• Convolutions are usually computed for each channel and summed:

• Kernel size aka receptive field (usually 1, 3, 5, 7, 11)

40Source: Dr. Francois Fleuret at EPFL

Multi-channel 2D Convolution

41Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Multi-channel 2D Convolution

42Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Kernel

w

h

c

Multi-channel 2D Convolution

43

Kernel

w

h

c

OutputInput

W

H

C

Multi-channel 2D Convolution

44Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Output

Kernel

w

h

c

Multi-channel 2D Convolution

45Source: Dr. Francois Fleuret at EPFL

Kernel

w

h

c

Input

W

H

C

Output

Multi-channel 2D Convolution

46Source: Dr. Francois Fleuret at EPFL

Kernel

w

h

c

Input

W

H

C

Output

Multi-channel 2D Convolution

47Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Kernel

w

h

c

Output

Multi-channel 2D Convolution

48Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Kernel

w

h

c

Output

Multi-channel 2D Convolution

49Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Kernel

w

h

c

Output

Multi-channel 2D Convolution

50Source: Dr. Francois Fleuret at EPFL

Input

W

H

C

Kernel

w

h

c

Output

Multi-channel 2D Convolution

51Source: Dr. Francois Fleuret at EPFL

OutputInput

W

H

C

Kernel

w

h

c

Multi-channel 2D Convolution

52Source: Dr. Francois Fleuret at EPFL

Output

H – h + 1

W – w + 1

1

Input

W

H

C

Kernel

w

h

c

Multi-channel and Multi-kernel 2D Convolution

53Source: Dr. Francois Fleuret at EPFL

Output

H – h + 1

W – w + 1

D

w

h

c

Kernels

Input

W

H

C

D

Dealing with Shapes

• Activations or feature maps shape

– Input (𝑊𝑖 , 𝐻𝑖 , 𝐶)

– Output (𝑊𝑜, 𝐻𝑜, 𝐷)

• Kernel of Filter shape (𝑤, ℎ, 𝐶, 𝐷)– 𝑤 × ℎ Kernel size

– 𝐶 Input channels

– 𝐷 Output channels

• Numbers of parameters: (𝑤 × ℎ × 𝐶 + 1) × 𝐷– bias

54Source: Dr. Francois Fleuret at EPFL

Multi-channel 2D Convolution

• The kernel is not swiped across channels, just across rows and columns.

• Note that a convolution preserves the signal support structure.– A 1D signal is converted into a 1D signal, a 2D signal into a 2D, and neighboring parts of the input signal influence neighboring

parts of the output signal.

• We usually refer to one of the channels generated by a convolution layer as an activation map.

• The sub-area of an input map that influences a component of the output as the receptive field of the latter.

55

Padding and Stride

56

Strides

• Strides: increment step size for the convolution operator

• Reduces the size of the output map

57

Example with kernel size 3×3 and a stride of 2 (image in blue)

Source: https://github.com/vdumoulin/conv_arithmetic

Padding

• Padding: artificially fill borders of image

• Useful to keep spatial dimension constant across filters

• Useful with strides and large receptive fields

• Usually fill with 0s

58Source: https://github.com/vdumoulin/conv_arithmetic

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1 , 1), a stride of (2 , 2)

59Source: Dr. Francois Fleuret at EPFL

1

Input

1

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

60Source: Dr. Francois Fleuret at EPFL

Input

Output

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

Padding and Stride

61Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

62Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

63Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

64Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

65Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

66Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

67Source: Dr. Francois Fleuret at EPFL

Input

Output

Padding and Stride

• Here with 5 × 5 × 𝐶 as input, a padding of (1,1), a stride of (2,2), and a kernel of size 3 × 3 × 𝐶

68Source: Dr. Francois Fleuret at EPFL

Input

Output

Nonlinear Activation Function

69

Pooling

70

Pooling

• Compute a maximum value in a sliding window (max pooling)

• Reduce spatial resolution for faster computation

• Achieve invariance to local translation

• Max pooling introduces invariances– Pooling size : 2×2

– No parameters: max or average of 2x2 units

71

Pooling

• The most standard type of pooling is the max-pooling, which computes max values over non-overlapping blocks

• For instance in 1D with a window of size 2

72Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

Input

r w

Pooling

73Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

Input

r w

w

Output

3

r

Pooling

74Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

w

3

Input

r w

Output

r

3

Pooling

75Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

w

3 3 0

Input

r w

Output

r

Pooling

76Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

w

3 3 0 2

Input

r w

Output

r

Pooling

77Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

w

3 3 0 2 2

Input

r w

Output

r

• Such an operation aims at grouping several activations into a single “more meaningful” one.

• The average pooling computes average values per block instead of max values

Pooling

78Source: Dr. Francois Fleuret at EPFL

1 3 2 3 0 -1 1 2 2 1

w

3 3 0 2 2

Input

r w

Output

r

Pooling: Invariance

• Pooling provides invariance to any permutation inside one of the cell

• More practically, it provides a pseudo-invariance to deformations that result into local translations

79Source: Dr. Francois Fleuret at EPFL

• Pooling provides invariance to any permutation inside one of the cell

• More practically, it provides a pseudo-invariance to deformations that result into local translations

Pooling: Invariance

80Source: Dr. Francois Fleuret at EPFL

Multi-channel Pooling

81Source: Dr. Francois Fleuret at EPFL

Input

r w

s h

C

Multi-channel Pooling

82

Input

r w

s h

C

Output

Multi-channel Pooling

83Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

84Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

85Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

86Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

87Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

88Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

89Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

90Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

91Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

92Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

93Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

94Source: Dr. Francois Fleuret at EPFL

Output

Input

r w

s h

C

Multi-channel Pooling

95Source: Dr. Francois Fleuret at EPFL

Input

r w

s h

C

Output

r

s

C

Inside the Convolution Layer Block

96

Conv blocks

Classic ConvNet Architecture

• Input

• Conv blocks– Convolution + activation (relu)

– Convolution + activation (relu)

– ...

– Maxpooling

• Output

– Fully connected layers

– Softmax

97

CNNs for Classification: Feature Learning

• Learn features in input image through convolution

• Introduce non-linearity through activation function (real-world data is non-linear!)

• Reduce dimensionality and preserve spatial invariance with pooling

98Source: 6.S191 Intro. to Deep Learning at MIT

CNNs for Classification: Class Probabilities

• CONV and POOL layers output high-level features of input

• Fully connected layer uses these features for classifying input image

• Express output as probability of image belonging to a particular class

99Source: 6.S191 Intro. to Deep Learning at MIT

CNNs: Training with Backpropagation

• Learn weights for convolutional filters and fully connected layers

• Backpropagation: cross-entropy loss

100Source: 6.S191 Intro. to Deep Learning at MIT

CNN in TensorFlow

101

Lab: CNN with TensorFlow

• MNIST example

• To classify handwritten digits

102

CNN Structure

103

Weights, Biases and Placeholder

104

Build a Model

• Convolution layers

1) The layer performs several convolutions to produce a set of linear activations

2) Each linear activation is running through a nonlinear activation function

3) Use pooling to modify the output of the layer further

• Fully connected layers

– Simple multi-layer perceptrons (MLP)

105

Convolution

• First, the layer performs several convolutions to produce a set of linear activations

– Filter size : 3×3

– Stride : The stride of the sliding window for each dimension of input

– Padding : Allow us to control the kernel width and the size of the output independently

• 'SAME' : zero padding

• 'VALID' : No padding

106

Activation

• Second, each linear activation is running through a nonlinear activation function

107

Pooling

• Third, use a pooling to modify the output of the layer further

– Compute a maximum value in a sliding window (max pooling)

– Pooling size : 2×2

108

Second Convolution Layer

109

Fully Connected Layer

• Fully connected layer

– Input is typically in a form of flattened features

– Then, apply softmax to multiclass classification problems

– The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.

110

Loss and Optimizer

• Loss

– Classification: Cross entropy

– Equivalent to applying logistic regression

• Optimizer

– GradientDescentOptimizer

– AdamOptimizer: the most popular optimizer

111

Optimization

112

Test or Evaluation

113

Test or Evaluation

114

CNN Implemented in an Embedded System

115