introduction to neural networks - computer...
TRANSCRIPT
Pictures are taken from
http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
http://research.microsoft.com/~cmbishop/PRML/index.htm
By Nobel Khandaker
INTRODUCTION TO NEURAL NETWORKS
Neural Networks – An Introduction
Overview of Neural Networks
Origin, Definitions, examples
Basic building blocks of Neural Networks
Perceptrons, Sigmoids
Gradient Descent Algorithm
BACKPROPAGATION Algorithm
2
What is a Neural Network? - I
A general, practical method for learning real-valued, discrete-valued and vector-valued functions from examples
Uses of Neural Networks:
Recognizing handwritten characters (Microsoft uses ANN)
Recognizing spoken words
Recognizing human faces
Interpreting visual scenes
Learning robot control strategies
3
What is a Neural Network? - II
Neural Network is a set of connected INPUT/OUTPUT UNITS, where each connection has a WEIGHT associated with it.
Neural Network learning is also called CONNECTIONIST learning due to the connections between units.
It is a case of SUPERVISED, INDUCTIVE or CLASSIFICATION learning.
Neural Network learns by adjusting the weights so as to be able to correctly classify the training data and hence, after testing phase, to classify unknown data.
4
Example of Neural Network5
Use of Neural Network
ALVINN system uses Neural
Networks to steer
autonomous vehicle (70
mph)
The Neural Network output
uses the camera input to
determine steering direction
6
Invention of Neural Networks
Biological learning systems are built of very
complex webs of interconnected neurons e.g. human
brain
Your brain takes about 10-1 s to recognize your mother
Neural networks are built using densely
interconnected set of simple units.
Each unit takes a number of real valued inputs and
produces a single real valued output
7
Strengths and Weaknesses of Neural
Networks - I
Strengths
Can handle against complex data (i.e., problems with
many parameters)
Can handle noise in the training data
Prediction accuracy is generally high
Neural Networks are robust, work well even when
training examples contain errors
Neural Networks can handle missing data well
8
Strengths and Weaknesses of NNs - II
Neural Network implementations are slow in the training phase
A major disadvantage of neural network lies in their knowledge representation.
Acquired knowledge in the form of a network units connected by weighted links is difficult for humans to interpret.
This factor has motivated research in extracting the knowledge embedded in trained neural network and in representing it in forms of symbolic rules
9
Perceptron
Perceptron
10
Use of Perceptron
Say +1 represents TRUE and -1 represents FALSE
How can we set the weights of a perceptron to
represent AND?
w0=-0.8, w1=w2=0.5
Name a boolean function that cannot be
represented by a single perceptron –
XOR
11
Perceptron Training Rule - I
Problem: Determine the weight vector that causes the perceptron to produce correct output for the training examples.
Several algorithms exist:
Perceptron Rule
Delta Rule
Both of these algorithms are guaranteed to converge
For perceptron rule, training examples are assumed to be linearly separable
1
12
Perceptron Training Rule - II
1
Learning will converge if:
training examples are assumed to be linearly separable
η is sufficiently small
13
Gradient Descent and Delta Rule - I
How to train perceptrons when the training
examples are not linearly separable?
Use the delta rule
Key idea in delta rule:
Use gradient descent to search the hypothesis space to
find the weights that best fit the training examples
14
Gradient Descent and Delta Rule -II
D - set of training examples
tD – target output for training examples
od – output of the linear unit for training example d
15
Gradient Descent and Delta Rule - III
The weights wo ,w1
plane represents entire
hypothesis space
Vertical axis
represents the error E
Gradient descent
search determines
weight vector to
minimize E
16
Gradient Descent Algorithm17
Multilayer Networks
Single perceptrons can only express linear decision
surfaces
Multilayer networks learned by can express non-
linear decision surfaces
We need a network that can represent highly non-
linear functions
We can use Sigmoid units.
18
Example of a Multilayer Network
Network was trained to recognize 1of 10 vowel sounds
Network input consist of F1, F2 obtained from spectral analysis of sound
Network prediction is the output whose value is highest
Decision regions of a
multilayer feed
forward network
19
Sigmoid Units
It computes the output
o as:
The range of the
output function is [0,1]
ye
y
where
xwo
1
1
20
BACKPROPAGATION Algorithm - I
Backpropagation (training_examples, η, nin , nout ,
nhidden )
denotes the pair of training values
denotes the vector of network input values
denotes the vector of target network output
values
η = learning rate
nin = number of network inputs
tx
,
x
t
21
BACKPROPAGATION Algorithm - II
Backpropagation (training_examples, η, nin , nout , nhidden )
nout = number of network outputs
nhidden = number of units in the hidden layer
xji denotes the input from i to j
wji denotes the weight from unit i to j
Since this is a network of multiple units, the error function is defined as:
2
2
1
Dd outputsk
kdkd otwE
22
BACKPROPAGATION Algorithm - III
Create a feed-forward
network with nin inputs,
nhidden hidden units, and
nout output units
Initialize all network
weights to small random
numbers (e.g., -0.05 and
0.05)
Until the termination
condition is met, Do
23
BACKPROPAGATION Algorithm - IV
BACKPROPAGATION algorithm uses a gradient descent search through the space of possible network weights , iteratively reducing E
Gradient descent may get trapped in any one of the local minimas
Only guaranteed to converge to some local minimum in E
However, in practice, the BACKPROPAGATION algorithm performs well
Gradient descent over complex error surfaces is poorly understood
24
BACKPROPAGATION Algorithm - V
No methods exist to predict with certainty when
local minima will cause difficulties
Heuristics used to alleviate the problem of local
minimas:
Train multiple networks using the same data, but
initialize each network with different random weights
Use stochastic gradient descent
Add a momentum term to the weight-update rule
25
Example of BACKPROPAGATION - I26
Example of BACKPROPAGATION - II27
Example of BACKPROPAGATION - III28
1
2
3
6
4
5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
Hidden Units
Output
Input Units
0.5
0
1
0
A Neural Network For Simulating AND Function
Example of BACKPROPAGATION - III
The given network was
trained using initial
weights randomly set
between (-1.0, 1.0)
Learning rate η = 0.3
(x,y) = (No. of
iterations of the outer
loop, Sum of Squared
errors)
29
Example of BACKPROPAGATION - IV
Evolution of hidden
layers
(x,y) = (No. of
iterations of the outer
loop, hidden unit
values)
30
Example of BACKPROPAGATION - V
Evolution of individual
weights
(x,y) = (No. of
iterations of the outer
loop, weights from
inputs to one hidden
unit)
31
Representational Power of
Feedforward Networks
Set of functions that can be represented:
Boolean functions
Number of hidden units required grows exponentially with the number of network inputs in the worst case
Continuous functions
Every bounded continuous function can be approximated with a network of two layers
Arbitrary functions
Any arbitrary function can be approximated to an arbitrary accuracy by a network of three layers
32
Regularization - I33
The number of input and outputs in a network are determined by the dimension of the data and the number of classes
The number of hidden units (M) is a free parameter that can be adjusted to give the best predictive performance
M also represents the weights and biases in the network
The sub-optimum value of M could result in under-fitting and over-fitting
Regularization - II34
Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set
Regularization - III35
How to control the complexity of a neural network
to avoid over-fitting
We can choose a relatively large value of M and
then control the complexity by adding a regularizer
term
A simple regularizer is:
This function is also known as: weight decay
wwwEwE T
2
~
Regularization - IV36
Problem: The simple weight decay function is
inconsistent with the scaling properties of network
mapping
Solution: A regularizer invariant under linear
transformations
Wi – set of weights in the ith layer
This regularizer remains unchanged with
21
2221
22 WwWw
ww
2
2/1
21
2/1
1 , ca
Invariances - I37
Predictions of a classifier should remain invariant
under any transformation of input variables
Example:
In handwritten character recognition:
Each character should be classified correctly irrespective of
its position (translation invariance)
Each character should be classified correctly irrespective of
its size (scale invariance)
Neural network can learn the invariance with
sufficient number of training examples
Invariances - II38
What if we do not have enough training examples?
Augment training set using replicas of the training
pattern
Example: make multiple copies of the training set of
character recognition problem where each character is
shifted to a different position
Add a regularization term to the error function that
penalizes changes in the output model when the input is
transformed
Invariances - III39
Synthetic warping of a handwritten digit. Top Right digits show the warped
input digit (Left) using random displacement and smoothing using Gaussians
of width 0.01, 30, 60. Displacement fields are shown in bottom right row.
Bayesian Neural Networks
Laplace approximation for a Bayesian neural
network with 8 hidden units and a single output unit
40
Conclusion41
What we have learned about Neural Networks?
What is a Neural Network – Definition, Examples
Strengths and weaknesses of Neural Networks
Basic building blocks – Perceptrons, Sigmoids
Perceptron Training Rules – Delta Rules, Gradient Descent
Multilayer Networks
BACKPROPAGATION Algorithm – description, example
Regularization
Invariances
Bayesian Neural Networks