introduction to artificial neural networks

Introduction toArtificial Neural Networks

Presented by: Ghayas Ur RehmanCourse Trainer: Dr. Tehseen JilaniDepartment of Computer Science

University of Karachi 1

Neural networks do not perform miracles. But if used sensibly they can produce some amazing results.

2

Back propagation Network: It is the most widely used architecture. It is very popular

technique that is relatively easy to implement. It requires large amount of training data for conditioning the network before using it for predicting the outcome.

A back-propagation network includes at-least one hidden layer. The approach is considered as “feed-forward/ back propagation”

approach.

Limitations: NNs do not do well at tasks that are not driven well by people. They lack the explaining facility. Training time can be excessive .

3

Simple BPNN

4

BPNN in simple words

Back-propagation is an algorithm that extends the analysis that underpins the delta rule to neural nets with hidden nodes. To see the problem, imagine that Bob tells Alice a story, and then Alice tells Ted. Ted checks the facts, and finds that the story is erroneous. Now, Ted needs to find out how much of the error is due to Bob and how much to Alice. When output nodes take their inputs from hidden nodes, and the net finds that it is in error, its weight adjustments require an algorithm that will pick out how much the various nodes contributed to its overall error. The net needs to ask, "Who led me astray? By how much? And, how do I fix this?" What's a net to do?

5

When not to use BPNN?A back-propagation neural network is only practical in certain situations. Following are some guidelines on when you should use another approach:

Can you write down a flow chart or a formula that accurately describes the problem? If so, then stick with a traditional programming method.

Is there a simple piece of hardware or software that already does what you want? If so, then the development time for a NN might not be worth it.

Do you want the functionality to "evolve" in a direction that is not pre-defined?

6

When not to use BPNN? (Contd.) Do you have an easy way to generate a significant number of

input/output examples of the desired behavior? If not, then you won't be able to train your NN to do anything.

Is the problem is very "discrete"? Can the correct answer can be found in a look-up table of reasonable size? A look-up table is much simpler and more accurate.

Are precise numeric output values required? NN's are not good at giving precise numeric answers.

7

When to use BPNN?Conversely, here are some situations where a BP NN might be a good idea:

A large amount of input/output data is available, but you're not sure how to relate it to the output.

The problem appears to have overwhelming complexity, but there is clearly a solution.

It is easy to create a number of examples of the correct behavior.

The solution to the problem may change over time, within the bounds of the given input and output parameters (i.e., today 2+2=4, but in the future we may find that 2+2=3.8).

Outputs can be "fuzzy", or non-numeric.

8

The most popular & successful method. Steps to be followed for the training:

◦ Select the next training pair from the training set( input vector and the output).

◦ Present the input vector to the network.

◦ Network calculate the output of the network.

◦ Network calculates the error between the network output and the desired output.

◦ Network back propagates the error

◦ Adjust the weights of the network in a way that minimizes the error.

◦ Repeat the above steps for each vector in the training set until the error is acceptable, for each training data set..

Back-propagation algorithm

Back-propagation algorithm

Step 1: Feed forward the inputs through networks:

a0 = pam+1 = fm+1 (Wm+1 am + bm+1), where m = 0, 1, ..., M –

1.a = aM

Step 2: Back-propagate the sensitive (error):

atnFs

MM

M 2

11

mTmmm

m sWnFs

where m = M – 1, ..., 2, 1.

Step 3: Finally, weights and biases are updated by following formulas: mmm

Tmmmm

kk

kk

sbb

asWW

1

1 1

.

(Details on constructing the algorithm and other related issues should be found on text book Neural Network Design)

at the output layer

at the hidden layers

10

Supervised Learning◦ Network is presented with the input and the desired

output.◦ Uses a set of inputs for which the desired outputs

results / classes are known. The difference between the desired and actual output is used to calculate adjustment to weights of the NN structure

Unsupervised Learning◦ Network is not shown the desired output.◦ Concept is similar to clustering◦ It tries to create classification in the outcome.

Network Training

11

Unsupervised Learning Only input stimuli (parameters) are presented to the network. The

network is self organizing, that is, it organizes itself internally, so that each hidden processing elements and weights responds appropriately to a different set of input stimuli.

No knowledge is supplied about the classification of outputs. However, the number of categories into which the network classifies the inputs can be controlled by varying certain parameters in the model. In any case, human expert must examine the final classifications to assign a meaning & usefulness of results.

Reinforcement Learning In between Supervised & Unsupervised learning. Network gets a feedback from the environment.

12

Learning ( Training) AlgorithmsThe training process requires a set of properly selected data in the form of network inputs and target outputs. During training, the weights and biases are iteratively adjusted to minimize the network performance function ( error). The default performance function is mean square error. Input data should be independent.

Back- Propagation learning algorithmThere are many variation. The commonly used one is: gradient descent algorithm:

x k+1 = xk - k gk

Where xk is a vector of current weights and biases and gk is current gradient and k is the chosen learning rate.

13

Back Propagation Learning Algorithm

It is the most commonly used generalization of the delta rule. This procedure involves two phases

(i) Forward phase: when the input is presented, it propagates forward through the network to compute output values for each processing element. For each PE all the current outputs are compared with the desired outputs and the error is computed.

(ii) Backward phase: The calculated error in now fed backward and weights are adjusted.

After completing both the phases, a new input is presented for the further training.

This technique is slow and can cause instability and has tendency to stuck in a local minima, but it is still very popular.

14

Gradient Descent Algorithm

The idea is to calculate an error each time the network is presented with a training vector (given that we have supervised learning where there is a target vector) and to perform a gradient descent on the error - considered as function of the weights. There will be a gradient or slope for each weight. Thus, we find the weights which give the minimal error.

Typically the error criterion is defined by the square of the difference between the pattern output and the target output( least squared error).

The total error E, is then just the sum of the pattern error square.

15

Error function (LMS)Target output

Network output

component.output pth for theoutput Network o and

componentpth for theoutput target theis twhere

,)(2

1

p

p

2p

pp

pp otEE

Note: LMS = least mean square

16

This method of weight adjustment is also known as steepest gradient descent technique or Widrow and Hoff rule and is most common type. This is also known as Delta rule.

17

Network Learning RulesHebbian Rule

The first and the best known learning rule was introduced by Donald Hebb. This basic rule is: If a neuron receives an input from another neuron, and if both are highly active (mathematically have the same sign), the weight between the neurons should be strengthened.

(t)(t).xη.y (t)w1)(tw ijijij

where xi(t) and yj(t) are the outputs at nodes i and j.

wij are the weights between the nodes i and j

Backpropagation: The Math General multi-layered neural network

0 1 2 3 4 5 6 7 8 9

0 1 i

0 1

Output Layer

Wi,0W0,0 W1,0

X9,0X0,0 X1,0

Hidden Layer

Input Layer

Backpropagation: The Math

Backpropagation◦ Calculation of hidden layer activation values


Backpropagation◦ Calculation of output layer activation values


Backpropagation◦ Calculation of error

dk = f(Dk) -f(Ok)


Backpropagation◦ Gradient Descent objective function

◦ Gradient Descent termination condition


Backpropagation◦ Output layer weight recalculation

Learning Rate (eg. 0.25)

Error at k

Backpropagation: The Math Backpropagation

◦ Hidden Layer weight recalculation

Backpropagation Using Gradient Descent

Advantages◦ Relatively simple implementation◦ Standard method and generally works well

Disadvantages◦ Slow and inefficient◦ Can get stuck in local minima resulting in sub-optimal

solutions

Local Minima

Local Minimum

Global Minimum

Alternatives To Gradient Descent Simulated Annealing

◦ Advantages Can guarantee optimal solution (global minimum)

◦ Disadvantages May be slower than gradient descent Much more complicated implementation

Alternatives To Gradient Descent Genetic Algorithms/Evolutionary Strategies

◦ Advantages Faster than simulated annealing Less likely to get stuck in local minima

◦ Disadvantages Slower than gradient descent Memory intensive for large nets

Alternatives To Gradient Descent Simplex Algorithm

◦ Advantages Similar to gradient descent but faster Easy to implement

◦ Disadvantages Does not guarantee a global minimum

Enhancements To Gradient Descent Momentum

◦ Adds a percentage of the last movement to the current movement

Enhancements To Gradient Descent Momentum

◦ Useful to get over small bumps in the error function◦ Often finds a minimum in less steps◦ w(t) = -n*d*y + a*w(t-1)

w is the change in weight n is the learning rate d is the error y is different depending on which layer we are calculating a is the momentum parameter

Enhancements To Gradient Descent

Adaptive Backpropagation Algorithm

◦ It assigns each weight a learning rate◦ That learning rate is determined by the sign of the

gradient of the error function from the last iteration If the signs are equal it is more likely to be a shallow slope so the

learning rate is increased The signs are more likely to differ on a steep slope so the

learning rate is decreased

◦ This will speed up the advancement when on gradual slopes

Enhancements To Gradient Descent Adaptive Backpropagation

◦ Possible Problems: Since we minimize the error for each weight separately the overall

error may increase◦ Solution:

Calculate the total output error after each adaptation and if it is greater than the previous error reject that adaptation and calculate new learning rates


SuperSAB(Super Self-Adapting Backpropagation)

◦ Combines the momentum and adaptive methods.◦ Uses adaptive method and momentum so long as the sign of the

gradient does not change This is an additive effect of both methods resulting in a faster traversal

of gradual slopes◦ When the sign of the gradient does change the momentum will

cancel the drastic drop in learning rate This allows for the function to roll up the other side of the minimum

possibly escaping local minima


SuperSAB◦ Experiments show that the SuperSAB converges faster than

gradient descent◦ Overall this algorithm is less sensitive (and so is less likely to

get caught in local minima)

Other Ways To Minimize Error Varying training data

◦ Cycle through input classes◦ Randomly select from input classes

Add noise to training data◦ Randomly change value of input node (with low probability)

Retrain with expected inputs after initial training◦ E.g. Speech recognition

Other Ways To Minimize Error Adding and removing neurons from layers

◦ Adding neurons speeds up learning but may cause loss in generalization

◦ Removing neurons has the opposite effect

38

1) In image analysisa. Text in image recognition.b. Finding oil fields.

2) Source Code recognition.

3) Reproducing similar sound.

4) Robotics

Applications of Backpropagation

39

Code recognizer

40

A Mad scientist wants to make billions of dollars by controlling the stock market. He will do this by controlling the stock purchases of several wealthy people. The scientist controls information that can be given by wall street insiders and has a device to control how much different people can trust each other. Using his ability to input insider information and control trust between people, he will control the purchases by wealthy individuals. If purchases can be made that are ideal to the mad scientist, he can gain capital by controlling the market.

Case study

41

Information is planted at the top level to Wall Street insiders. They then relay this information to stock brokers who are their friends. The brokers then relay that information to their favorite wealthy clients who then make trades. The weight for each edge is the amount of trust that person has for the person above them. The more they trust a person, the more likely they are to either pass along information or make a trade based on the information.

42

As a mad scientist, you will need to adjust this social network in order to create optimal actions in the market place. You do this using your secret Trust 'o' Vac 2000. With it you can increase or decrease each trust weight how you see fit. You then observe the trades that are made by the rich dudes. If the trades are not to your liking, then we consider this error. The more to your liking the trades are, the less error they contain. Ideally, you want to slowly adjust the network so that it gets closer and closer to what you want and contains less error. In general terms this is referred to as gradient descent.

Case study (Contd…)

43

As you place insider information, you observe the amount of error coming out of your network. If a person is making trades that rather poor you need to figure out where they are getting the information to do so. A strong trust (shown by a thick line) indicates where more error is coming from and where larger changes need to be made

44

There are many ways in which we can adjust the trust weights, but we will use a very simple method here. Each time we place some insider information, we watch the trades that come from our rich dudes. If there is a large error coming from one rich dude, then they are getting bad information from someone they trust too much or are not getting good information from someone they should trust more. When the mad scientist sees this, he uses the Trust 'o' Vac 2000 to weaken a strong trust by a little and strengthen a weak trust by a little. Thus, we try to slowly cut off the source of bad information and increase the source of good information going to the rich dudes

Case study (Contd…)

45

We next have to adjust the trust weights between the CEO's and the brokers. We do this by propagating error backwards: if a strong weight exists between a broker and a rich dude who is making bad purchases on a regular basis, then we can attribute that error to the broker. We can then make the rich dude trust this broker less and also adjust the weights of trust between the broker and the fat cats in a similar way

46

The EndThanks for your patience

introduction to artificial neural networks

Documents