human brain the brain is a highly complex, non-linear, and parallel computer, composed of some 10 11...
TRANSCRIPT
Human Brain
• The brain is a highly complex, non-linear, and parallel computer, composed of some 1011 neurons that are densely connected (~104 connection per neuron). We have just begun to understand how the brain works...
• A neuron is much slower (10-3sec) compared to a silicon logic gate (10-9sec), however the massive interconnection between neurons make up for the comparably slow rate.
Human Brain
• Plasticity: Some of the neural structure of the brain is present at birth, while other parts are developed through learning, especially in early stages of life, to adapt to the environment (new inputs).
Biological Neuron
Biological Neuron
– dendrites: nerve fibres carrying electrical signals to the cell
– cell body: computes a non-linear function of its inputs
– axon: single long fiber that carries the electrical signal from the cell body to other neurons
– synapse: the point of contact between the axon of one cell and the dendrite of another, regulating a chemical connection whose strength affects the input to the cell.
Inspiration from Neurobiology
• A neuron: many-inputs / one-output unit
• output can be excited or not excited
• incoming signals from other neurons determine if the neuron shall excite ("fire")
• Output subject to attenuation in the synapses, which are junction parts of the neuron
A Neuron on a microchip
A Biological Neuron
A Photomicrograph of Neurons
Biological Neuro-Signal
Neural networks
• Neural network: information processing paradigm inspired by biological nervous systems, such as our brain
• Structure: large number of highly interconnected processing elements (neurons) working together
• Like people, they learn from experience (by example)
Neural networks
• Neural networks are configured for a specific application, such as pattern recognition or data classification, through a learning process
• In a biological system, learning involves adjustments to the synaptic connections between neurons
same for artificial neural networks (ANNs)
Where can neural network systems help
• when we can't formulate an algorithmic solution.
• when we can get lots of examples of the behavior we require.
‘learning from experience’
• when we need to pick out the structure from existing data.
Complicated Example: Categorising Vehicles
INPUT INPUT INPUT INPUT
• Input to function: pixel data from vehicle images
– Output: numbers: 1 for a car; 2 for a bus; 3 for a tank
OUTPUT = 3 OUTPUT = 2 OUTPUT = 1 OUTPUT=1
Artificial Neuron – Feed Forward
fjx
1x
niw
iw1
jiw
j
jjii xwI
)( ii Ify
(1) Summation
(2) TransferNeuron i
nx
iy
Transfer Functions
Artificial Neuron – Error Backward
fjx
1x
niw
iw1
jiw
Neuron i
nx
iy E
Perceptron
X1 X2 X3
YOutput layer
Input layer
Perceptron (cont.)
• Perceptron was introduced by Rosenblatt, 1957• He introduced the idea of training• A Perceptron is a linear threshold gate
Given a classification problem try to find a perceptron to fit:
otherwise
if 1 output
:that such threshold a and weights ofvector a Find
0
ii xw
iw
Perceptron – Feed Forward
fjx
1x
niw
iw1
jiw
j
jjii xwI
)( ii Ify
(1) Summation
(2) TransferNeuron i
nx
iy
Perceptron – Error Backward
fjx
1x
niw
iw1
jiw
Neuron i
nx
iy E
Perceptron : Weight Adjustment
The Perceptron learning rule:
If the perceptron gives the correct answer, do nothing
If the perceptron gives the wrong answer, adjust the weights and threshold “in the right direction”, so that it eventually gives the right answer.
Perceptron : Training Algorithm
),...,,,1(W
),...,,,w( weights Update22
),...,,(
),...,,(),...,,(12
:classifiedcorrectly not sample a is there While-2
),...,,,(w : weightsIntial -1
21
210
21
2121
210
n
n
iin
nn
n
xxx
www
xwxxxa
xxxaxxxd
www
Perceptron : Error value
set in thenot is),...,,(but
...when
set in the is),...,,(but
...when
given isanswer correct the
),...,,(),...,,(
21
2211
21
2211
2121
xxx
xwxwxw 1
xxx
xwxwxw 1
0
xxxaxxxd
n
nn
n
nn
nn
Perceptron : Learning Rate
• is the rate at which the training rule converges toward the correct solution.
• Typically <=1
• Too small an produces slow convergence.
• Too large of an can cause oscillations in the process.
Multi-layer Perceptron
Output layer
Hidden layer
Input layer
Layer SLayer S
Layer S+1Layer S+1
Layer S-1Layer S-1
Function Learning• Map categorisation learning to numerical problem
– Each category given a number– Or a range of real valued numbers (e.g., 0.5 - 0.9)
• Function learning examples– Input = 1,2,3,4 Output = 1,4,9,16– Here the concept to learn is squaring integers– Input = [1,2,3], [2,3,4], [3,4,5], [4,5,6]– Output = 1, 5, 11, 19– Here the concept is: [a,b,c] -> a*c - b
• The calculation is more complicated than in the first example• Neural networks:
– Calculation is much more complicated in general– But it is still just a numerical calculation
Example Perceptron
• Categorisation of 2x2 pixel black & white images
– Into “bright” and “dark”
• Representation of this rule:
– If it contains 2, 3 or 4 white pixels, it is “bright”
– If it contains 0 or 1 white pixels, it is “dark”
• Perceptron architecture:
– Four input units, one for each pixel
– One input unit: +1 for white, -1 for dark
Example Perceptron
• Example calculation: x1=-1, x2=1, x3=1, x4=-1– S = 0.25*(-1) + 0.25*(1) + 0.25*(1) + 0.25*(-1) = 0
• 0 > -0.1, so the output from the ANN is +1– So the image is categorised as “bright”
Worked Example
• Return to the “bright” and “dark” example• Use a learning rate of η = 0.1• Suppose we have set random weights:
Worked Example
• Use this training example, E, to update weights:
• Here, x1 = -1, x2 = 1, x3 = 1, x4 = -1 as before
• Propagate this information through the network:
– S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1) = -2.2
• Hence the network outputs o(E) = -1
• But this should have been “bright”=+1
– So t(E) = +1
Calculating the Error Values
• Δ0 = η(t(E)-o(E))x0
= 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2
• Δ1 = η(t(E)-o(E))x1
= 0.1 * (1 - (-1)) * (-1) = 0.1 * (-2) = -0.2
• Δ2 = η(t(E)-o(E))x2
= 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2
• Δ3 = η(t(E)-o(E))x3
= 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2
• Δ4 = η(t(E)-o(E))x4
= 0.1 * (1 - (-1)) * (-1) = 0.1 * (-2) = -0.2
Calculating the New Weights
• w’0 = -0.5 + Δ0 = -0.5 + 0.2 = -0.3
• w’1 = 0.7 + Δ1 = 0.7 + -0.2 = 0.5
• w’2 = -0.2 + Δ2 = -0.2 + 0.2 = 0
• w’3= 0.1 + Δ3 = 0.1 + 0.2 = 0.3
• w’4 = 0.9 + Δ4 = 0.9 - 0.2 = 0.7
New Look Perceptron
• Calculate for the example, E, again:– S = (-0.3 * 1) + (0.5 * -1) + (0 * +1) + (0.3 * +1) + (0.7 * -1) = -1.2
• Still gets the wrong categorisation– But the value is closer to zero (from -2.2 to -1.2)– In a few epochs time, this example will be correctly categorised
Boolean Functions
• Take in two inputs (-1 or +1)• Produce one output (-1 or +1)• In other contexts, use 0 and 1• Example: AND function
– Produces +1 only if both inputs are +1• Example: OR function
– Produces +1 if either inputs are +1• Related to the logical connectives from F.O.L.
Boolean Functions as Perceptrons
• Problem: XOR boolean function
– Produces +1 only if inputs are different
– Cannot be represented as a perceptron
– Because it is not linearly separable
Linearly Separable Boolean Functions
• Linearly separable:– Can use a line (dotted) to separate +1 and –1
• Think of the line as representing the threshold– Angle of line determined by two weights in perceptron– Y-axis crossing determined by threshold
Linearly Separable Functions
• Result extends to functions taking many inputs
– And outputting +1 and –1
• Also extends to higher dimensions for outputs
Typical Activation Functions
• F(x) = 1 / (1 + e -k ∑ (wixi) )
• Shown for
k = 0.5, 1 and 10
• Using a nonlinear function which approximates a linear threshold allows a network to approximate nonlinear functions
Learning performance
• Network architecture
• Learning method:
– Unsupervised
– Reinforcement learning
– Backpropagation
Unsupervised learning
• No help from the outside
• No training data, no information available on the desired output
• Learning by doing
• Used to pick out structure in the input:
• Clustering
• Reduction of dimensionality compression
• Example: Kohonen’s Learning Law
Competitive learning: example
• Example: Kohonen network
Winner takes all
only update weights of winning neuron
• Network topology
• Training patterns
• Activation rule
• Neighborhood
• Learning
Reinforcement learning
• Teacher: training data
• The teacher scores the performance of the training examples
• Use performance score to shuffle weights ‘randomly’
• Relatively slow learning due to ‘randomness’
Back propagation
• Desired output of the training examples
• Error = difference between actual & desired output
• Change weight relative to error size
• Calculate output layer error , then propagate back to previous layer
• Improved performance, very common!
Applications
• Prediction: learning from past experience
– pick the best stocks in the market
– predict weather
– identify people with cancer risk
• Classification
– Image processing
– Predict bankruptcy for credit card companies
– Risk assessment
Applications
• Recognition
– Pattern recognition: SNOOPE (bomb detector in U.S. airports)
– Character recognition
– Handwriting: processing checks
• Data association
– Not only identify the characters that were scanned but identify when the scanner is not working properly
Applications
• Data Conceptualization
– infer grouping relationshipse.g. extract from a database the names of those most likely to buy a particular product.
• Data Filtering
e.g. take the noise out of a telephone signal, signal smoothing
• Planning
– Unknown environments
– Sensor data is noisy
– Fairly new approach to planning
Strengths of a Neural Network
• Power: Model complex functions, nonlinearity built into the network
• Ease of use:– Learn by example– Very little user domain-specific expertise needed
• Intuitively appealing: based on model of biology, will it lead to genuinely intelligent computers/robots?
Neural networks cannot do anything that cannot be done using traditional computing techniques, BUT they can do some
things which would otherwise be very difficult.
General Advantages
• Advantages– Adapt to unknown situations– Robustness: fault tolerance due to network redundancy– Autonomous learning and generalization
• Disadvantages– Not exact– Large complexity of the network structure
• For motion planning?
Applications• Aerospace
– High performance aircraft autopilots, flight path simulations, aircraft control systems, autopilot enhancements, aircraft component simulations, aircraft component fault detectors
• Automotive– Automobile automatic guidance systems, warranty activity analyzers
• Banking– Check and other document readers, credit application evaluators
• Defense– Weapon steering, target tracking, object discrimination, facial recognition,
new kinds of sensors, sonar, radar and image signal processing including data compression, feature extraction and noise suppression, signal/image identification
• Electronics– Code sequence prediction, integrated circuit chip layout, process control, chip
failure analysis, machine vision, voice synthesis, nonlinear modeling
Applications• Financial
– Real estate appraisal, loan advisor, mortgage screening, corporate bond rating, credit line use analysis, portfolio trading program, corporate financial analysis, currency price prediction
• Manufacturing
– Manufacturing process control, product design and analysis, process and machine diagnosis, real-time particle identification, visual quality inspection systems, beer testing, welding quality analysis, paper quality prediction, computer chip quality analysis, analysis of grinding operations, chemical product design analysis, machine maintenance analysis, project bidding, planning and management, dynamic modeling of chemical process systems
• Medical
– Breast cancer cell analysis, EEG and ECG analysis, prosthesis design, optimization of transplant times, hospital expense reduction, hospital quality improvement, emergency room test advisement
Applications• Robotics
– Trajectory control, forklift robot, manipulator controllers, vision systems
• Speech– Speech recognition, speech compression, vowel classification, text to
speech synthesis• Securities
– Market analysis, automatic bond rating, stock trading advisory systems• Telecommunications
– Image and data compression, automated information services, real-time translation of spoken language, customer payment processing systems
• Transportation– Truck brake diagnosis systems, vehicle scheduling, routing systems
Properties of ANNs
• Learning from examples
– labeled or unlabeled
• Adaptivity
– changing the connection strengths to learn things
• Non-linearity
– the non-linear activation functions are essential
• Fault tolerance
– if one of the neurons or connections is damaged, the whole network still works quite well
Artificial Neuron Model
Neuroni Activation
x0= +1
x1
x2
x3
xm
wi1
wim
ai
Input Synaptic Output Weights
f
function
bi :Bias
Bias
• n
• ai = f (ni) = f (Swijxj + bi)
• i = 1
• An artificial neuron:
• - computes the weighted sum of its input and
• - if that value exceeds its “bias” (threshold),
• - it “fires” (i.e. becomes active)
Bias
• Bias can be incorporated as another weight clamped to a fixed input of +1.0
• This extra free variable (bias) makes the neuron more powerful.
• n
• ai = f (ni) = f (Swijxj)
• i = 0
Other Activation Functions
Different Network Topologies
• Single layer feed-forward networks
– Input layer projecting into the output layer
Input Output layer layer
Single layer network
Different Network Topologies
• Multi-layer feed-forward networks
– One or more hidden layers. Input projects only from previous layers onto a layer.
Input Hidden Output layer layer layer
2-layer or1-hidden layerfully connectednetwork
Different Network Topologies
• Recurrent networks
– A network with feedback, where some of its inputs are connected to some of its outputs (discrete time).
Input Output layer layer
Recurrentnetwork
How to Decide on a Network Topology?
– # of input nodes?
• Number of features
– # of output nodes?
• Suitable to encode the output representation
– transfer function?
• Suitable to the problem
– # of hidden nodes?
• Not exactly known
Examples:
Input Units
Hidden Units
Output units
weights
AutoassociationHeteroassociation
• hj=g(wji.xi) y1=g(wkj.hj)where g(x)= 1/(1+e )
x1 x2 x3 x4 x5 x6
h1 h2 h3
y1 k
j
i
wji’s
wkj’sg (sigmoid):
0
1/20
1
How is a function computed by a Multilayer Neural Network?
Typically, y1=1 for positive example and y1=0 for negative example
Learning in Multilayer Neural Networks
• Learning consists of searching through the space of all possible matrices of weight values for a combination of weights that satisfies a database of positive and negative examples (multi-class as well as regression problems are possible).
• Note that a Neural Network model with a set of adjustable weights defines a restricted hypothesis space corresponding to a family of functions. The size of this hypothesis space can be increased or decreased by increasing or decreasing the number of hidden units present in the network.
The Perceptron Training RuleOne way to learn an acceptable weight vector is to begin with random weights, then iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example. This process is repeated, iterating through the training examples as many times as needed until the perceptron classifies all training examples correctly. Weights are modified at each step according to the perceptron training rule, which revises the weight associated with input according to the rule
Gradient Descent and Delta Rule
The delta training rule is best understood by considering the task of training an unthresholded perceptron; that is, a linear unit for which the output o is given by
In order to derive a weight learning rule for linear units, let us begin by specifying a measure for the training error of a hypothesis (weight vector), relative to the training examples.
BACKPROPAGATION Algorithm
EECP0720 Expert Systems – Artificial Neural Networks
Error Function
The Backpropagation algorithm learns the weights for a multilayer network, given a network with a fixed set of units and interconnections. It employs gradient descent to attempt to minimize the squared error between the network output values and the target values for those outputs. We begin by redefining E to sum the errors over all of the network output units
where outputs is the set of output units in the network, and tkd and okd are the target and output values associated with the kth output unit and training example d.
Architecture of Backpropagation
Backpropagation Learning Algorithm
Backpropagation Learning Algorithm (cont.)
Backpropagation Learning Algorithm (cont.)
Backpropagation Learning Algorithm (cont.)
Backpropagation Learning Algorithm (cont.)
Output
• The response function is normally nonlinear
• Samples include– Sigmoid
– Piecewise linear
xexf
1
1)(
xif
xifxxf
,0
,)(
Backpropagation Preparation
• Training SetA collection of input-output patterns that are used to train the network
• Testing SetA collection of input-output patterns that are used to assess network performance
• Learning Rate-ηA scalar parameter, analogous to step size in numerical integration, used to set the rate of adjustments
Network Error
• Total-Sum-Squared-Error (TSSE)
• Root-Mean-Squared-Error (RMSE)
patterns outputs
actualdesiredTSSE 2)(2
1
outputspatterns
TSSERMSE
*##
*2
Face Detection using Neural Networks
Neural
Network
Face Database
Non-Face Database
Training ProcessOutput=1, for face database
Output=0, for non-face database
Face
orNon-
Face?
Test
ing P
roc e
ss
Backpropagation Using Gradient Descent
• Advantages– Relatively simple implementation– Standard method and generally works well
• Disadvantages– Slow and inefficient– Can get stuck in local minima resulting in sub-optimal
solutions
Local Minima
Local Minimum
Global Minimum
Other Ways To Minimize Error
• Varying training data– Cycle through input classes– Randomly select from input classes
• Add noise to training data– Randomly change value of input node (with low
probability)• Retrain with expected inputs after initial training
– E.g. Speech recognition
Other Ways To Minimize Error
• Adding and removing neurons from layers– Adding neurons speeds up learning but may
cause loss in generalization– Removing neurons has the opposite effect
References
[1] L. Smith, ed. (1996, 2001), "An Introduction to Neural Networks", URL: http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html
[2] Sarle, W.S., ed. (1997), Neural Network FAQ, URL: ftp://ftp.sas.com/pub/neural/FAQ.html
[3] StatSoft, "Neural Networks", URL: http://www.statsoftinc.com/textbook/stneunet.html
[4] S. Cho, T. Chow, and C. Leung, "A Neural-Based Crowd Estimation by Hybrid Global Learning Algorithm", IEEE Transactions on Systems, Man and Cybernetics, Part B, No. 4. 1999.