What is Deep Learning?
Activation of Action Potentials
Artificial Neural Network (ANN)
Sigmoid Activation Function
Multi-Layer Neural Networks
• Nonlinear classifier• Training: find network weights w to minimize the error between true
training labels yi and estimated labels fw(xi):N
i=1
• Minimization can be done by gradient descent provided f is differentiable• This training method is called back-propagation
(y - f (x ))åE(w) = i w i2
Shallow Network vs. Deep Network
# of Hidden Layer <= 1 (i.e., shallow network) # of Hidden Layer >= 2 (i.e., deep network)
Training: Forward-propagation
http://www.slideshare.net/keepurcalm/intro-to-deep-learning-autoencoders
Three layer neural network : two inputs & one output weights of connections
Training via Backpropagation (1)error=target-output
weight updates
Training via Backpropagation (2)final weight updates
Difficulty of Training Deep Neural Network (or Multi Layer NN)
• Vanishing gradient problem• Problem with nonlinear activation function• Gradient (error signal) decreases exponentially with the number of layers
and the front layers train very slowly.• Over-fitting
• Given limited amounts of labeled data, training via backpropagation does not work well
• Local minima• Difficulty in optimization
Traditional Solution : Feature Extraction• Supervised learning
Hand-designed feature
extraction
Trainable classifier
Image/VideoPixels
• Features are not learned, but extracted by humans• Trainable classifier
Object Class
Facial Feature Extraction
New Solutions for Deep Net• Vanishing gradient problem
• Solved by a new non-linear activation function: rectified linear unit (ReLU) in 2010, 2011
• Over-fitting• Solved by new regularization methods: dropout
(Hinton et al., 2012) etc.
• Local minima• Solved by high-dimensional non-convex optimization: local minima are all similar• Local minima are good and close to global minima
Supervised vs. Unsupervised :: Shallow vs. Deep
• Supervised learning for shallow net
Hand-designed feature
extraction
Trainable classifier
Image/VideoPixels
Object Class
• Unsupervised learning for deep net
Layer 1 Layer N Simple classifier
Object Class
Image/VideoPixels
Deep learning: “Deep” architecture
…
Deep Learning: Multiple Levels of Feature Representation
What is Deep Learning?• Deep learning is training a deep network.• Deep learning = hierarchical learning• Replace (supervised) handcrafted features for unsupervised feature leaning and hierarchical feature extraction
• The key of deep net = weights• Various deep learning architectures: deep belief network (DBN), convolutional neural network (CNN), recurrent neural network (RNN)
• Application areas: visual object recognition, object detection, speech recognition, bioinformatics, etc.
Three Well-known Deep Learning Algorithms
1. Deep Belief Network (DBN)
2. Convolutional Neural Network (CNN)
3. Recurrent Neural Network (RNN)
What is CNN?
Visual Processing of The Brain
Hierarchical Visual Representation
Convolutional Neural Networks (CNN, Convnet)• Neural network with specialized connectivity structure• Stack multiple stages of feature extractors• Higher stages compute more global, more invariant features• Classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied todocument recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
• Feed-forward feature extraction:1. Convolve input with learned filters2. Non-linearity3. Spatial pooling4. Normalization
• Supervised training of convolutional filters by back-propagating classification error
Non-linearity
Convolution (Learning)
Input Image
Spatial pooling
Normalization
Convolutional Neural Networks: 기본 구조
Feature maps
1. Convolution
• Dependencies are local• Translation invariant• Few parameters (filter weights)• Stride can be greater than 1
(faster, less memory)
.
.
.
Input Feature Map
Convolution
그림, 내용 출처 : http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
5x5 image를 3x3 kernel로Convolution 하는 과정
Kernel의 계수에 따라 각각 다른
feature를 얻을 수 있는데 일반적으로
계수들은 특정 목적에 따라 고정이 되
지만 CNN에 사용되는 kernel은 학습
을 통해 최적의 계수를 결정한다.
2. Non-Linearity
• Per-element (independent)• Options:
• Tanh• Sigmoid: 1/(1+exp(-x))• Rectified linear unit(ReLU)
– Simplifies backpropagation– Makes learning faster– Avoids saturation issues
à Preferred option
3. Spatial Pooling
• Sum or max• Non-overlapping / overlapping regions• Role of pooling:
• Invariance to small transformations• Larger receptive fields (see more of input)
Max
Sum
Subsampling = Pooling
통상적인 sub-sampling은 보통 고정된 위치에
있는 픽셀을 고르거나, 혹은 sub-sampling 윈도
우 안에 있는 픽셀들의 평균을 취한다.
CNN에서의 sub-sampling은 신경세포와 유사한
방법으로 강한 신호만 전달하고 나머지 신호는
무시하는 max-pooling 방식을 사용한다.
그림,내용출처 : http://blog.naver.com/laonple/220608018546
4. Normalization
• Within or across feature maps• Before or after spatial pooling
Feature MapsFeature Maps
After Contrast Normalization
CNN Applications
• Handwritten text/digits• MNIST (0.17% error [Ciresan et al. 2011])• Arabic & Chinese [Ciresan et al. 2012]
• Simpler recognition benchmarks• CIFAR-10 (9.3% error [Wan et al. 2013])• Traffic sign recognition
– 0.56% error vs 1.16% for humans[Ciresan et al. 2011]
• But until recently, less good at more complex datasets• Caltech-101/256 (few training examples)
Brain vs. CNN
Deep Learning Resources
• Google TensorFlow https://www.tensorflow.org/• UC Berkeley Caffe http://caffe.berkeleyvision.org/• Matlab Toolbox
• https://kr.mathworks.com/discovery/deep-learning.html• https://kr.mathworks.com/matlabcentral/fileexchange/38310-deep-
learning-toolbox
• Microsoft Cognitive Toolkit (CNTK) https://github.com/Microsoft/CNTK/wiki/KDD-2016-Tutorial