ch6 ann and ga
TRANSCRIPT
-
7/31/2019 ch6 ann and ga
1/104
1
Neural Networks: Definition
Neural computing is the study of networks of adaptable nodes
which, through a process of learning from task examples, store
experiential knowledge and make it available for use.
-
7/31/2019 ch6 ann and ga
2/104
2
What Are Neural Networks?
A computing model, inspired by the mammalian neural system,
composed of many simple, highly interconnected processing
units.
Neural network models are algorithms for cognitive tasks, such aslearning and optimization, which are in a loose sense based on
concepts derived from research into the nature of the brain.
-
7/31/2019 ch6 ann and ga
3/104
3
What Are Neural Networks?
Neural network model is a directed graph with the following
properties:
A state variable ni is associated with each node i.
A real value weight wij is associated with each link from node i
to node j.
A real value bias i is associated with each node i.
A transfer function fi(nj, wij, i) is defined, for each node i,
which determines the state of node i.
-
7/31/2019 ch6 ann and ga
4/104
4
What Can ANN Do?
Biological
Modeling the retina
Modeling brain disorders (ADD)
Business
Evaluate probability of oil in geological formation
Identify and filter promotion and job applicants
Mine corporate databases for business rules
Financial
Assessing credit risk
Identify forgeries
Interpret handwritten forms
Predict portfolio and stock values
-
7/31/2019 ch6 ann and ga
5/104
5
What Can ANN Do?
Manufacturing
Automated robot control systems
Control material flow
Optimize production lines
Quality inspection
Medical
Analyze speech in hearing aids
Diagnose and prescribe treatment by symptoms
Monitor surgery and recovery
Read X-rays and CET/PET Scans
-
7/31/2019 ch6 ann and ga
6/104
6
What Can ANN Do?
Military
Classify radar and sonar signals
Target acquisition and tracking
Analyze intelligence inputs
Optimizing scarce resources
Signal processing
Adaptive Noise Canceling
Zip Code Reader
Speech Recognition
-
7/31/2019 ch6 ann and ga
7/104
7
A Brief History
First concepts
Turing 1936
McCulloch & Pitts 1943
Hebb 1949
Early steps 1950s - 1960s
The perceptron
ADALINE and MADALINE
Excessive hype
-
7/31/2019 ch6 ann and ga
8/104
8
A Brief History
Stunted growth 1969-1981
Perceptrons by Minskey and Papert
Continued work
Renewed interest
The Hopfield model 1982
Backpropagation rediscovered 1985 (first 1974 by
Werbos)
Radial Basis Functions - Broomhead & Lowe 1988
-
7/31/2019 ch6 ann and ga
9/104
9
A Quick Word About The Brain
-
7/31/2019 ch6 ann and ga
10/104
10
The Biological Neuron
Cell Body Synapse() Dendrites() Axons()
-
7/31/2019 ch6 ann and ga
11/104
11
Computers And The Brain
We do not understand the brain
The ANN model is only loosely based on the brain
The ANN model is metaphoric to the brain
-
7/31/2019 ch6 ann and ga
12/104
12
Computers vs. Neural Networks
Von-Neumann Machines Neural Networks
Few strong processors ~1011 Simple neurons
Serial processing Parallel processing
Central control No central control
10-9 sec. Cycle 10-3 sec. Cycle
Bit data Voltage data
Not tolerant Very robust
Fast numeric operations Slow numeric
operations
Slow high operations Fast high operations
Learning ? Learning !
-
7/31/2019 ch6 ann and ga
13/104
13
Building Blocks Of The Model
The processing element
The connections
Learning methods
-
7/31/2019 ch6 ann and ga
14/104
14
Processing Element Building Block
The basic building block of a neural network is the
processing element (or node or unit).
A generalised node embodies elements:
inputs(+bias)
weights
transfer function
combining function
activation function
output(s)
-
7/31/2019 ch6 ann and ga
15/104
15
The function of a single node
The job of a processing element is to receive a number of
inputs (either from the external world or from other nodes
or from itself) and to distribute a single output (either to
the external world or to other nodes).
-
7/31/2019 ch6 ann and ga
16/104
16
Some Input Functions
Weighted Summation
net = w1x1 + w2x2+ + wnxn + bias
where wi is the weight associated with the connection
between an input and the processing element
-
7/31/2019 ch6 ann and ga
17/104
17
Some Input Functions
Multiplication (or Product)
net = w1 x1 * w2x2* * wnxn
similar to the weighted summation but the summation is
replaced by the product
Maximum, Minimum, Majority
net = max (wnxn)
net = min (wnxn)
net = 1 IF (wnxn) > 0 ELSE -1
-
7/31/2019 ch6 ann and ga
18/104
18
Some Activation Functions
Sigmoid
maps an input into a value between zero and one
Linear
where no transformation takes place to the outcome of
the combing function
Tangent
similar to the sigmoid but the mapping is between -1 and
1
Step
where the transfer value equals 1 if the outcome of the
combing function is greater than some threshold,
otherwise it equals 0
-
7/31/2019 ch6 ann and ga
19/104
19
Some Activation Functions
-
7/31/2019 ch6 ann and ga
20/104
20
Closer Look At Transfer Functions
Unipolar
Sigmoid
Threshold()
Bipolar
Sigmoid
Sign
-
7/31/2019 ch6 ann and ga
21/104
21
The Connections
The connections are the only thing changing in neural
networks
Connections may be either inhibitory or excitatory
Connection strengths are expressed by weights
-
7/31/2019 ch6 ann and ga
22/104
22
The role of the weights
Each input or node is connected to a processing element
Graphically this is represented by an arc
Each arc has a weight. The weight simply determines the
influence (or strength) of an input to a processing element
Neuro-computing is concerned with identification of thecorrect set of weights
-
7/31/2019 ch6 ann and ga
23/104
23
An example of a single node
Assume a processing element receives 3 inputs: 1 0.5 0.3
If the combining function is the weighted summation and the
weights are: -0.2 0.04 2.35
then the result of the combining function is 0.705
1
0.5
0.3
-0.2
0.04
2.35
0.705
-
7/31/2019 ch6 ann and ga
24/104
24
An example of a single node
If the activation function is
linear f(x)=x then output is 0.705
1
0.5
0.3
-0.2
0.04
2.35
0.705f(x)=x
0.705
-
7/31/2019 ch6 ann and ga
25/104
25
An example of a single node
If the activation function is
sigmoid then output is 1 / (1 + exp(-0.705)) = 0.669
1
0.5
0.3
-0.2
0.04
2.35
0.705f(x)=1/(1+exp(x)
0.705
-
7/31/2019 ch6 ann and ga
26/104
26
Neural Networks Layers
NN can be constructed using a number of processing
elements
Rather than a chaotic construction it is generally preferable
to build neural networks using layersA neural network will have an input layer, an output layer and
in between zero, one or more of hidden layers
-
7/31/2019 ch6 ann and ga
27/104
27
Neural Network Layers 2
Depending on where a processing element is placed, it is
categorised as an input, hidden or output processing
element
Typically, but not necessarily, each processing element ina layer has the same transfer function
a NN with 4-3-2 configuration is a 2 or 3 layer NN
(depends on if input layer is counted) with 4 input nodes, 3
hidden nodes, 2 output nodes
-
7/31/2019 ch6 ann and ga
28/104
28
The Role of the Input Layer
An input processing element receives input from the external
world and simply sends the actual input to the processing
elements of the next layer
-
7/31/2019 ch6 ann and ga
29/104
29
The Role of the Hidden Layer
A hidden processing element receives its input from the
nodes of the previous layer and the transformation of the
input is sent to the next layer
A hidden layer may be seen as a pre-processor
-
7/31/2019 ch6 ann and ga
30/104
30
The Role of the Output Layer
An output processing element delivers the representation of
the original input after transformations have taken place to
the world
-
7/31/2019 ch6 ann and ga
31/104
31
Connectivity Matters
A number of different networks can be constructed - differ in
terms of the connectivity pattern and the number of layers
No hidden layers are called single-layer networks
One or more hidden layers are called multi-layer networks
If all connections lead from input to output then it is called
a feed-forward network
If there are connections in the opposite direction then it is
called a feedback or recurrent network
-
7/31/2019 ch6 ann and ga
32/104
32
Artificial Neural Networks Models
Single layer
feedforward
Multi layer
feedforward
Recurrent
( feedforward )
-
7/31/2019 ch6 ann and ga
33/104
33
Calculations of a multi-layer feed-forward
neural network
x2
+1
+1
1.5
-1
0.5
+1
+1 0.5+1
x1
x4
x3
x5
-
7/31/2019 ch6 ann and ga
34/104
34
Learning Laws
As we saw on the previous slide the output with the current
weights is wrong if we want to perform AND.
This bring to us the problem of finding the correct set ofweights
The process of identifying the correct set of weights is called
the learning process and it is characterised by a learninglaw
-
7/31/2019 ch6 ann and ga
35/104
35
Learning Laws 2
The purpose of a learning law is to locate the set of weights
which will give correct answers for all the inputs
The learning is achieved by employing an algorithm whichiteratively changes the weights of the connections in
response to every set of inputs until the correct weights
have been located
-
7/31/2019 ch6 ann and ga
36/104
36
Learning Laws 3
Most learning laws are based on Hebbs rule which states
that
if two units are simultaneously active, increase the
strength of the connection between them
This rule is the basis for most learning laws used today
(Kohonen learning, Boltzman learning, Delta rule)
-
7/31/2019 ch6 ann and ga
37/104
37
Some Learning Rules
Hebbian learning rule
Perceptron learning rule
Delta learning rule
Widrow-Hoff learning rule
j
t
iij xxwcfw )(
jtiiij xxwdcw sgn
jiiiij xnetfodcw'
j
t
iiij xxwdcw
-
7/31/2019 ch6 ann and ga
38/104
38
Learning Methods
Supervised approach
a neural network is given a set of inputs and also the
correct output
-
7/31/2019 ch6 ann and ga
39/104
39
Learning Methods 2
Unsupervised approach
a neural network is given a set of inputs and no outputs.
The network attempts to generate its own classes
-
7/31/2019 ch6 ann and ga
40/104
40
Learning Methods 3
Reinforcement approach
a neural network is given a set of inputs and no outputs.
The network generates an output and only then it is
told if the produced output was correct or not
Learn by doing
-
7/31/2019 ch6 ann and ga
41/104
41
Single-Layer Perceptrons
Network architecture
x1
x2
x3
w1
w2
w3
w0
y= signum(net)
y=step(net)
net= xi * wi -
= xi * wi + w0
where w0 =
= xi * wi
where i=0 nowSignum(net) = 1 if net > 0
else -1
Step(net)=1 if net > 0 else 0
-
7/31/2019 ch6 ann and ga
42/104
42
Example I - The AND Function
X1
X2
W2
=
W1 =
W0
= O
1
1
2
1,1 ---> 1
rest ---> 0
-
7/31/2019 ch6 ann and ga
43/104
43
Single-Layer Perceptrons
If correct response no modification takes place, else
An entire pass through all of the input training vectors is
called an epoch. When such an entire pass of the training
set has occurred without error, training is complete.
jtiiij xxwdcw sgn
-
7/31/2019 ch6 ann and ga
44/104
44
Limitations
Perceptron networks have several limitations.
First, the output values of a perceptron can take on only one
of two values (True or False).
Second, perceptrons can only classify linearly separable setsof vectors. If a straight line or plane can be drawn to
separate the input vectors into their correct categories, the
input vectors are linearly separable and the perceptron will
find the solution. If the vectors are not linearly separable
learning will never reach a point where all vectors are
classified properly.
The most famous example is the boolean XOR problem.
-
7/31/2019 ch6 ann and ga
45/104
45
The XOR problem
In 1960s perceptrons created a great deal of interest until.
M.Minsky and S. Papert Perceptrons MIT Press
Cambridge MA 1969
single-layer perceptrons can only be used for toy problemssince
cannot represent a simple XOR function
-
7/31/2019 ch6 ann and ga
46/104
46
The XOR problem 2
The task is to classify a binary input vector to class 0 if the
vector has an even number of 1s or assign it to class 1.
A two-input binary XOR truth table:
0 0 0
0 1 1
1 0 1
1 1 0
-
7/31/2019 ch6 ann and ga
47/104
47
The XOR problem 3
Recall that the output of a perceptron is given as follows:
1 if the weighted input is greater than 0
0 otherwise
The first input of XOR is 0 0 with desired output as 0
hence the weighted input must be less or equal than zero
in order to get the desired output
0 w1 + 0 w2 + 1 wo < = 0
wo < = 0
-
7/31/2019 ch6 ann and ga
48/104
48
The XOR problem 4
The second input of XOR is 0 1 with desired output as 1
hence the weighted input must be greater than zero in
order to get the desired output
0 w1 + 1 w2 + 1 wo > 0
w2 + wo > 0
-
7/31/2019 ch6 ann and ga
49/104
49
The XOR problem 5
The third input of XOR is 1 0 with desired output as 1
hence the weighted input must be greater than zero in
order to get the desired output
1 w1 + 0 w2 + 1 wo > 0
w1 + wo > 0
-
7/31/2019 ch6 ann and ga
50/104
50
The XOR problem 6
The fourth input of XOR is 1 1 with desired output as 0
hence the weighted input must be less or equal than zero
in order to get the desired output
1 w1 + 1 w2 + 1 wo < = 0
w1 + w2 + wo < = 0
-
7/31/2019 ch6 ann and ga
51/104
51
The XOR problem 7
In summary the percptron requires satisfying the following
four inequalities
wo < = 0
w2 + wo > 0w1 + wo > 0
w1 + w2 + wo < = 0
The first inequality tell us that wo must be less or equal to
zero. Therefore for 2nd and 3rd to apply must have w2and w1 respectively as positive numbers - which
contradicts with the 4th which says that their summation
must be negative or zero
-
7/31/2019 ch6 ann and ga
52/104
52
Linear Separability
For binary inputs and outputs using the step function the
output is 1 if the net input is positive and 0 if the net input
is negative
net_input = 0: for two-inputs this equation represents a
line
If there are weights so that all of the training input vectorsfor which the correct response is +1 lie on one side of
the decision line and all of the training input vectors for
which the correct response is 0 lie on the other side of
the boundary then the problem is linearly separable
-
7/31/2019 ch6 ann and ga
53/104
53
Linear Separability
-
7/31/2019 ch6 ann and ga
54/104
54
The XOR problem 8
The XOR problem is not linearly separable
We can not use a single-layer perceptron to construct a
straight line to partition the two dimensional input
space into two regions, each containing only data
points of the same class
X
Y
0
1
0 1
0
0
1
1
-
7/31/2019 ch6 ann and ga
55/104
55
Multi-Layer Perceptrons
The lack of suitable training methods for multi-layer
perceptrons (MLPs) led to a waning of interest until the
reformulation of the backpropagation training method
Previous work used signum or step activation functionswhich are nondifferentiable, now continuous activation
functions are employed
-
7/31/2019 ch6 ann and ga
56/104
56
Multi-Layer Perceptrons 2
All nodes (or neurons) perform the same function on
incoming signals
a composite of the weighted sum and a differentiable
nonlinear activation function together known as thetransfer function
-
7/31/2019 ch6 ann and ga
57/104
57
Multi Layer Feedforward Networks
The layers that are neither input nor output are called hidden
layers
Hidden layers extract high order statistics and in a way
provide an overall view of the input dataThe output of each layer is used as input to the next layer
There is no theoretical limit on connections between non
neighboring layers
-
7/31/2019 ch6 ann and ga
58/104
58
MLP Architecture 2-2-1
x2 In p u t le ve l
In te r m e d ia tele ve l (H id d e n )
O u tp u t le ve l
y
x1
h1 h2
-
7/31/2019 ch6 ann and ga
59/104
59
Activation Functions
Logistic function
f(net) = 1 / (1 + e -net )
Hyperbolic tangent function
f(net) = tanh(net/2) = (1 - e -net ) / (1 + e -net ) =
(2 / (1+e -net) ) - 1 = (e net - e -net) / (e net + e -net)
Identity function
f(net) = net
where net is the weighted input
-
7/31/2019 ch6 ann and ga
60/104
60
Activation Functions 2
Logistic and Hyperbolic tangent function
approximate the signum and step function respectively
but they provide smooth, non-zero derivatives with
respect to the input signalsreferred to as squashing functions since the inputs to
these functions are squashed to the range [0,1] or [-
1,1]
referred to as sigmoidal functions because of their S-
shaped curves
the hyperbolic is sometimes referred to as the bipolar
sigmoidal
the logistic is sometimes referred to as the binary
sigmoidal
-
7/31/2019 ch6 ann and ga
61/104
61
Activation Functions Graphs
The Logistic Function
-2
The Hyperbolic Function
-2
-
7/31/2019 ch6 ann and ga
62/104
62
Identity Activation Function
Identity function
it is usually employed for nodes of the output layer to
approximate a continuous valued function not limited to
[0,1] or [-1,1]such nodes are referred to as the linear nodes
The Identity Function
-2
-
7/31/2019 ch6 ann and ga
63/104
63
Binary and Bipolar Sigmoid Derivatives
f(net) = 1 / (1 + e -net )
f(net) = f(net) [ 1-f(net) ]
f(net) = (2 / (1+e -net) ) - 1
f(net) = 0.5 [ 1 + f(net) ] [ 1 - f(net) ]
-
7/31/2019 ch6 ann and ga
64/104
64
LearningLearning target:
minimize the difference between actual outputs and target
outputs
Learning rule:
Steepest descent (Back-propagation)
Conjugate gradient method
All optimization methods using first derivativeDerivative-free optimization
-
7/31/2019 ch6 ann and ga
65/104
65
MLP and the backpropagation algorithm
-
7/31/2019 ch6 ann and ga
66/104
66
-
7/31/2019 ch6 ann and ga
67/104
67
-
7/31/2019 ch6 ann and ga
68/104
68
MLP and the backpropagation algorithm
oj
( d e s ir e do u tp u t )
hi wi j
wkixk
XS ig n a l E rr o r
In p u t L a y e r H id d e n L a y e r O u t p u t L a y e r
yj
-
7/31/2019 ch6 ann and ga
69/104
69
Backpropagation Algorithm
0 Initialise Weights
1 While Stopping condition is false, do steps 2 to 9
-
7/31/2019 ch6 ann and ga
70/104
70
Backpropagation Algorithm 2
2 For each training pair, do steps 3 to 8
Feedforward pass
3 Each input unit receives input signal and broadcasts this
signal to all units in the layer above (the hidden units)4 Each hidden unit sums its weighted input signals, applies
its activation function to compute its output signal and
sends this signal to all units in the layer above (output
units)
5 Each output unit sums its weighted input signals and
applies its activation function to compute its output signal
End of Feedforward Pass
-
7/31/2019 ch6 ann and ga
71/104
71
Backpropagation Algorithm 3
Backward Pass
6 Each output unit receives a target pattern corresponding
to the input training pattern, computes its error information
term, calculates its weight and bias correction term, andsends its error information term to units in the layer
below
7 Each hidden unit sums its error information terms (from
units in the layer above) multiplies by the derivative of its
activation function to calculate its error information term,calculates its weight and bias correction term
End of Backward pass
-
7/31/2019 ch6 ann and ga
72/104
72
Backpropagation Algorithm 4
Updating Pass
8 Each output unit updates its bias and weights. Each
hidden unit updates its bias and weights.
End of Updating pass
9 Test stopping criterion
-
7/31/2019 ch6 ann and ga
73/104
73
Backpropagation Algorithm 5
-
7/31/2019 ch6 ann and ga
74/104
74
Problems
How to determine the architecture?
How to determine the parameters?
How to get global optima?
... ...
-
7/31/2019 ch6 ann and ga
75/104
75
GA and ANN
Three levels:
connection weights: introduce an adaptive and global
approach to training
architectures: adapt the topologies to different tasks withouthuman intervention and thus provide an approach to
automatic ANN design as both ANN connection weights
and structures
learn rules: learning to learn, an adaptive process of
automatic discovery of novel learning rules
-
7/31/2019 ch6 ann and ga
76/104
76
Evolution of connection weights
Weight training in ANNs is usually formulated as
minimization of an error function, such as the mean
square error between target and actual outputs averaged
over all examples, by iteratively adjusting connectingweights.
BP often gets trapped in a local minimum of the error
function and is incapable of finding a global minimum if the
error function is multimodal and/or nondifferentiable.
GA can be used effectively in the evolution to find a near-optimal set of connection weights globally without
computing gradient information.
-
7/31/2019 ch6 ann and ga
77/104
77
Typical cycle of the evolution of the
connection weights
1 Decode each individual in the current generation into a set
of connection weights and construct a corresponding ANN
with the weights
2 Evaluate each ANN by computing its total mean squareerror between actual and target outputs. The fitness of an
individual is determined by the error. A regularization term
may be included in the fitness function to penalize large
weights.
3 Select parents for reproduction based on their fitness
4 Apply genetic operators, such as crossover and mutation,
to parents to generate offspring, which form the next
generation
-
7/31/2019 ch6 ann and ga
78/104
78
Representation
Binary or real number
Put connection weights to the same node together. Nodes in
ANN are in essence feature extractors and detectors.
Separating inputs to the same node far apart wouldincrease the difficulty of constructing useful feature
detectors because they might be destroyed by crossover
operators.
Permutation problem: The many-to-one mapping from the
representation to the actual ANN since two ANNs thatorder their hidden nodes differently in their chromosomes
will still be equivalent functionally. This makes crossover
operator very inefficient in producing good offspring.
-
7/31/2019 ch6 ann and ga
79/104
79
-
7/31/2019 ch6 ann and ga
80/104
80
Comparison between GA and BP
GA can handle the global search problem better. It can be
used to train many different networks regardless of their
architecture and saves a lot of human efforts in
developing different training algorithm for different types of
ANN.
GA makes it easier to generate ANN with some special
characteristics.
GA is much less sensitive to initial conditions of training.
There is no clear winner in terms of the best training
algorithm.
-
7/31/2019 ch6 ann and ga
81/104
81
Hybrid training
Combine GAs global search ability with local searchs ability
to fine tune. GA can be used to locate a good region in the
space and then a local search procedure is used to find a
near-optimal solution in this region.
-
7/31/2019 ch6 ann and ga
82/104
82
The evolution of architecture
The architecture of an ANN includes its topological structure,
i.e., connectivity, and the transfer function of each node in
the ANN.
The architecture has significant impact on a networksinformation processing capabilities. Given a learning task,
an ANN with only a few connections and linear nodes may
not be able to perform the task at all due to its limited
capability, while an ANN with a large number of
connections and nonlinear nodes may overfit noise in thetraining data and fail to have good generalization ability.
-
7/31/2019 ch6 ann and ga
83/104
83
Traditional way to design the architecture
There is no systematic way to design a near-optimal
architecture for a given task automatically.
A constructive algorithm starts with a minimal network
(network with minimal number of hidden layers, nodes and
connections) and adds new layers, nodes andconnections when necessary during training.
A destructive algorithm starts with a maximal network
(network with maximal number of hidden layers, nodes
and connections) and deletes unnecessary layers, nodes
and connections when during training.
Such structural hill climbing methods are susceptible to
becoming trapped at structural local optima. They only
investigate restricted topological subsets rather than the
complete class of network architecture.
-
7/31/2019 ch6 ann and ga
84/104
84
Typical cycle of the evolution of
architecture
1 Decode each individual in the current generation into an
architecture.
2 Train each ANN with the decoded architecture by a
predefined learning rule starting from different sets ofrandom initial connection weights and learning rule
parameters.
3 Compute the fitness of each individual according to the
above training result and other performance criteria such
as the complexity of the architecture.
4 Select parents from the population based on their fitness.
5 Apply search operators to the parents and generate
offspring which form the next generation.
-
7/31/2019 ch6 ann and ga
85/104
85
The direct encoding scheme
An NN matrix C=(c(i,j)) can represent an ANN architecture
with N nodes, where c(i,j) indicates presence or absence
of the connection from node i to node j.
Such an encoding scheme can handle both feedforward andrecurrent ANNs.
-
7/31/2019 ch6 ann and ga
86/104
86
A feedforward ANN
-
7/31/2019 ch6 ann and ga
87/104
87
A recurrent ANN
-
7/31/2019 ch6 ann and ga
88/104
88
Notes about direct encoding scheme
It is straightforward to implement.
Training error, training time, complexity can be used in the
fitness function
A large ANN would require a very large matrix and thusincrease the computation time of the evolution. Domain
knowledge can be used to reduce the search space
The permutation problem still exists
-
7/31/2019 ch6 ann and ga
89/104
89
The indirect encoding scheme
Only some characteristics of an architecture are encoded to
reduce the length of the chromosome. The details about
each connection in an ANN is either predefined according
to prior knowledge or specified by a set of deterministic
development rules.
-
7/31/2019 ch6 ann and ga
90/104
90
Parametric representation
ANN architectures may be specified by a set of parameters
such as the number of hidden layers, the number of
hidden nodes in each layer, the number of connections
between two layers, etc.
In general the parametric representation method will be most
suitable when we know what kind of architectures we are
trying to find.
-
7/31/2019 ch6 ann and ga
91/104
91
Example of pattern recognition
Input Output Input Output
0000 00 0100 00
1100 00 1000 00
1001 01 0000 011101 01 0101 01
0010 11 1010 11
0110 11 1110 11
0011 10 0111 101011 10 1111 10
In fact the first two bits of the input are noise and the output
is the Gray code of the last two bits of the input.
-
7/31/2019 ch6 ann and ga
92/104
92
Chromosome
We use a 16-bit chromosome
The first 2 bits stand for the study ratio: 0.5, 0.25, 0.125,
0.0625
The next 2 bits stands for the momentum: 0.9, 0.8, 0.7, 0.6The next 2 bits stands for the range of the initial weight: 1,
0.5, 0.25, 0.125
The next 5 bits is used for the 1st hidden layer: the first bit
means if there is a hidden layer and the other 4 bits
stands for the number of hidden units.
The last 5 bits is used for the 2nd hidden layer: the first bit
means if there is a hidden layer and the other 4 bits
stands for the number of hidden units.
-
7/31/2019 ch6 ann and ga
93/104
93
Evolution and result
Only use the first 8 samples for evolution.
Use 7 of these 8 samples for training the ANN and the other
one is used to get the fitness.
Finally we get a 4-1-4-2 ANN(structure and weight).In order to check the final result we use the other 8 samples
and compare with a 4-16-16-2 ANN which is trained by BP.
-
7/31/2019 ch6 ann and ga
94/104
94
Developmental rule representation
Development rules, which are used to construct architectures,
are encoded in chromosomes.
A development rule is usually described by a recursive
equation or a production system.How to get such a set of rules to construct an ANN? One
answer is to evolve them. We can encode the whole rule
set as an individual (Pittsburgh approach) or encode each
rule as an individual (Michigan approach)
-
7/31/2019 ch6 ann and ga
95/104
95
Examples of some development rules
-
7/31/2019 ch6 ann and ga
96/104
96
Development of an ANN architecture
Si lt l ti f hit t &
-
7/31/2019 ch6 ann and ga
97/104
97
Simultaneous evolution of architectures &
weights
-
7/31/2019 ch6 ann and ga
98/104
98
Evolution of learning rules
An ANN training algorithm may have different performance
when applied to different architectures. The design of
training rules, more fundamentally the learning rules used
to adjust weights, depends on the type of architectures
under investigation. Different variants of the Hebbian
learning rule have been proposed to deal with different
architectures. It is desirable to develop an automatic and
systematic way to adapt the learning rule to an
architecture and the task to be performed. Designing alearning rule manually often implies that some
assumptions, which are not necessarily true in practice,
have to be made.
T i l l f th l ti f l i
-
7/31/2019 ch6 ann and ga
99/104
99
Typical cycle of the evolution of learning
rule
1 Decode each individual in the current generation into a
learning rule
2 Construct a set of ANNs with randomly generated
architectures and initial connection weights, and trainthem using the decoded learning rule.
3 Calculate the fitness of each individual according to the
average training result
4 Select parents from the current generation according to
their fitness
5 Apply search operators to parents to generate offspring
which form the new generation
-
7/31/2019 ch6 ann and ga
100/104
100
Evolution of algorithm parameters
The adaptive adjustment of BPs parameters through
evolution could be considered as the first attempt to the
evolution of learning rules.
Some researchers used an GA process to find parametersfor BP but ANNs architecture was predefined. The
parameters evolved in this case tend to be optimized
towards the architecture rather than being generally
applied to learning.
Some researchers encoded BPs parameters inchromosomes together with ANNs architecture.
-
7/31/2019 ch6 ann and ga
101/104
101
Evolution of learning rules
The evolution of learning rules has to work on the dynamic
behavior of an ANN.
Try to develop a universal representation scheme which can
specify any kind of dynamic behaviors is clearlyimpractical.
Two basic assumptions which have often been made on
learning rules are 1) weight-updating depends only on
local information such as the activation of the input node,
the activation of the output node, the current connectionweight, etc.; 2) the learning rule is the same for all
connections in an ANN
-
7/31/2019 ch6 ann and ga
102/104
102
Learning rule
A learning rule can be described by the following function
There are three major issues involved in the evolution of
learning rules: 1) determination of a subset of terms
described in the above equation; 2) representation of the
coefficients as chromosomes, and 3) the GA used to
evolve these chromosomes.
-
7/31/2019 ch6 ann and ga
103/104
103
Other combination between GA and ANN
Evolution of input features: finding a near-optimal set of input
features to an ANN
ANN as fitness estimator: the time-consuming fitness
evaluation based on real systems is replaced by fastfitness evaluation based on ANN
Evolving ANN ensembles: combining different individuals in
the population to form an integrated system is expected to
produce better results.
A general framework for GA and ANN
-
7/31/2019 ch6 ann and ga
104/104