introduction to convolutional neural...

Introduction to Convolutional Neural Networks

2018 / 02 / 23

Buzzword: CNNConvolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that are applied to analyzing visual imagery.

Buzzword: CNN● Convolution

From wikipedia,

Buzzword: CNN● Neural Networks

Background: Visual Signal Perception

Background: Signal Relay

Starting from V1 primary visual cortex, visual signal is transmitted upwards, becoming more complicated and abstract.

Background: Neural Networks

Express the equations in Matrix form, we have

Neural Networks for ImagesFor computer vision, why can’t we just flatten the image and feed it through the neural networks?

Neural Networks for ImagesImages are high-dimensional vectors. It would take a huge amount of parameters to characterize the network.

Convolutional Neural NetworksTo address this problem, bionic convolutional neural networks are proposed to reduced the number of parameters and adapt the network architecture specifically to vision tasks.

Convolutional neural networks are usually composed by a set of layers that can be grouped by their functionalities.

Sample Architecture

Convolution Layer

● The process is a 2D convolution on the inputs.

● The “dot products” between weights and inputs are “integrated” across “channels”.

● Filter weights are shared across receptive fields. The filter has same number of layers as input volume channels, and output volume has same “depth” as the number of filters.

Convolution Layer

Activation Layer

● Used to increase non-linearity of the network without affecting receptive fields of conv layers

● Prefer ReLU, results in faster training

● LeakyReLU addresses the vanishing gradient problem

Other types: Leaky ReLU, Randomized Leaky ReLU, Parameterized ReLU Exponential Linear Units (ELU), Scaled Exponential Linear UnitsTanh, hardtanh, softtanh, softsign, softmax, softplus...

Softmax

● A special kind of activation layer, usually at the end of FC layer outputs

● Can be viewed as a fancy normalizer (a.k.a. Normalized exponential function)

● Produce a discrete probability distribution vector

● Very convenient when combined with cross-entropy loss

Given sample vector input x and weight vectors {wi}, the predicted probability of y = j

Pooling Layer

● Convolutional layers provide activation maps.

● Pooling layer applies non-linear downsampling on activation maps.

● Pooling is aggressive (discard info); the trend is to use smaller filter size and abandon pooling

Pooling Layer

FC Layer● Regular neural network● Can view as the final

learning phase, which maps extracted visual features to desired outputs

● Usually adaptive to classification/encoding tasks

● Common output is a vector, which is then passed through softmax to represent confidence of classification

● The outputs can also be used as “bottleneck”

In above example, FC generates a number which is then passed through a sigmoid to represent grasp success probability

Loss Layer

● L1, L2 loss● Cross-Entropy loss (works

well for classification, e.g., image classification)

● Hinge Loss● Huber Loss, more resilient

to outliers with smooth gradient

● Minimum Squared Error (works well for regression task, e.g., Behavioral Cloning)

Binary case

General case

Regularization

● L1 / L2 ● Dropout● Batch norm● Gradient clipping● Max norm constraint

To prevent overfitting with huge amount of training data

Dropout

● During training, randomly ignore activations by probability p

● During testing, use all activations but scale them by p

● Effectively prevent overfitting by reducing correlation between neurons

Batch Normalization

● Makes networks robust to bad initialization of weights

● Usually inserted right before activation layers● Reduce covariance shift by normalizing and

scaling inputs● The scale and shift parameters are trainable

to avoid losing stability of the network

Example: ResNet ● Residual Network, by Kaiming He (2015)

● Heavy usage of “skip connections” which are similar to RNN Gated Recurrent Units (GRU)

● Commonly used as visual feature extractor in all kinds of learning tasks, ResNet50, ResNet101, ResNet152

● 3.57% Top-5 accuracy, beats human

ApplicationsCan be viewed as a fancy feature extractor, just like SIFT, SURF, etc

CNN & Vision: RedEyehttps://roblkw.com/papers/likamwa2016redeye-isca.pdf

Two keys: Add noise, save bandwidth/energy

CNN & Robotics: RL ExampleUsually used with Multi-Layer Perceptron (MLP, can be viewed as a fancy term for non-trivial neural networks) for policy networks.

Software 2.0Quotes from Andrej Karpathy:

Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer is identifying a specific point in program space with some desirable behavior.

Software 2.0 is written in neural network weights. No human is involved in writing this code because there are a lot of weights (typical networks might have millions). Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a program that satisfies the constraints.

Is CNN the Answer?Capsule?

End2End is not the right way?

introduction to convolutional neural...

Documents