the backpropagation network-1

7/27/2019 The BackPropagation Network-1

http://slidepdf.com/reader/full/the-backpropagation-network-1 1/20

The BackPropagation Network:Learning by Example

Copyright © Devin McAuley, 1997.1. Introduction2. Background *

o Simple Learning Machineso Why a Hidden Layer?o The Credit Assignment Problem for Multilayer Networks

3. The BackPropagation Algorithm *o Feedforward Dynamicso Gradient Descento Derivation of the BackProp Learning Rule

4. Simulations Issueso How to Select Initial Weightso Local Minimao Learning Rate and Momentumo When to Update Weightso The Development of Internal Representations

5. Making Predictions: The Jets and Sharks Revisited6. References7. Answers to Questions

* These sections contain some mathematics which can be omitted on afirst reading if desired.

Introduction

In many real world situations, we are faced with incomplete or noisydata, and it is important to be able to make reasonable predictionsabout what is missing from the information available. This can be anespecially difficult task when there isn't a good theory available to helpreconstruct the missing data. It is in such situations thatBackpropagation (BackProp) networks may provide some answers.A BackProp network consists of at least three layers of units: an input layer, at least one intermediate

hiddenlayer, and an

outputlayer

(see Figure 1). In contrast to the IAC and Hopfield networks, connectionweights in a BackProp network are oneway. Typically, units areconnected in a feed-forward fashion with input units fully connected tounits in the hidden layer and hidden units fully connected to units inthe output layer. When a BackProp network is cycled, an input patternis propagated forward to the output units through the interveninginput-to-hidden and hidden-to-output weights.

http://psy.uq.oz.au/~brainwav/Manual/#Introduction

http://psy.uq.oz.au/~brainwav/Manual/#Background

http://psy.uq.oz.au/~brainwav/Manual/#Perceptrons

http://psy.uq.oz.au/~brainwav/Manual/#Hidden

http://psy.uq.oz.au/~brainwav/Manual/#LearningProblem

http://psy.uq.oz.au/~brainwav/Manual/#Algorithm

http://psy.uq.oz.au/~brainwav/Manual/#ActivationDynamics

http://psy.uq.oz.au/~brainwav/Manual/#GradientDescent

http://psy.uq.oz.au/~brainwav/Manual/#ErrorBackPropagation

http://psy.uq.oz.au/~brainwav/Manual/#Simulations

http://psy.uq.oz.au/~brainwav/Manual/#InitialWeights

http://psy.uq.oz.au/~brainwav/Manual/#LocalMinima

http://psy.uq.oz.au/~brainwav/Manual/#LearningRateMomentum

http://psy.uq.oz.au/~brainwav/Manual/#When

http://psy.uq.oz.au/~brainwav/Manual/#Representation

http://psy.uq.oz.au/~brainwav/Manual/#Generalization

http://psy.uq.oz.au/~brainwav/Manual/#References

http://psy.uq.oz.au/~brainwav/Manual/BackPropagationAnswers.html

http://psy.uq.oz.au/~brainwav/Manual/#Introduction

http://psy.uq.oz.au/~brainwav/Manual/#Background

http://psy.uq.oz.au/~brainwav/Manual/#Perceptrons

http://psy.uq.oz.au/~brainwav/Manual/#Hidden

http://psy.uq.oz.au/~brainwav/Manual/#LearningProblem

http://psy.uq.oz.au/~brainwav/Manual/#Algorithm

http://psy.uq.oz.au/~brainwav/Manual/#ActivationDynamics

http://psy.uq.oz.au/~brainwav/Manual/#GradientDescent

http://psy.uq.oz.au/~brainwav/Manual/#ErrorBackPropagation

http://psy.uq.oz.au/~brainwav/Manual/#Simulations

http://psy.uq.oz.au/~brainwav/Manual/#InitialWeights

http://psy.uq.oz.au/~brainwav/Manual/#LocalMinima

http://psy.uq.oz.au/~brainwav/Manual/#LearningRateMomentum

http://psy.uq.oz.au/~brainwav/Manual/#When

http://psy.uq.oz.au/~brainwav/Manual/#Representation

http://psy.uq.oz.au/~brainwav/Manual/#Generalization

http://psy.uq.oz.au/~brainwav/Manual/#References

http://psy.uq.oz.au/~brainwav/Manual/BackPropagationAnswers.html



We can interpret the output of a BackProp network as a classificationdecision. In Chapter 3, we explored how the IAC network can be usedto encode and retrieve information about members of two rival gangs(the Jets and the Sharks). For this problem, we could have instead useda BackProp network to make gang classification judgments (Jets OR

Sharks) based on the characteristics of the gang members. A keydifference between the BackProp and IAC approaches is learning. IACnetworks have fixed weights, which means that for the Jets and Sharksproblem, all of the knowledge about the gang members has to be hard-wired into the network. BackProp networks are not limited in this waybecause they can adapt their weights to acquire new knowledge. Inthis chapter, we will explore how BackProp networks learn by example,and can be used to make predictions.With BackProp networks, learning occurs during a training phase inwhich each input pattern in a training set is applied to the input unitsand then propagated forward. The pattern of activation arriving at the

output layer is then compared with the correct (associated) outputpattern to calculate an error signal. The error signal for each suchtarget output pattern is then backpropagated from the outputs to theinputs in order to appropriately adjust the weights in each layer of thenetwork. After a BackProp network has learned the correctclassification for a set of inputs, it can be tested on a second set of inputs to see how well it classifies untrained patterns. Thus, animportant consideration in applying BackProp learning is how well thenetwork generalizes.

Figure 1: Classification with a Backpropagation network The task of the BackProp network shown in Figure 1 is to classifyindividuals as Jets or Sharks using their age, educational level, marital



status, and occupation as clues to what gang they belong to. Thetraining data for the network (Table 1) consists of 16 individuals (8 Jetsand 8 Sharks). The network applied to this training data is composed of 12 binary inputs (representing the different characteristics of gangmembers), 4 hidden units, and 2 output units (Jets or Sharks). As you

can see from the information in the Table, Jets and Sharks gangmembers are either in their 20's, 30's or 40's, have a junior high (J.H.),high school (H.S.) or college education, are single, married, ordivorced, and work as pushers, burglars, or bookies. The network istrained to correctly classify the inputs as Jets or Sharks based on thisinformation. For members of the Jets gang, the Jets unit has a targetactivation of 1.0 and the Sharks unit has a target activation of 0.0. Onthe other hand, for members of the Sharks get, the Sharks unit has atarget activation of 1.0, and the Jets unit has a target activation of 0.0.

Table 1: Members of the Jets and Sharks gangs (adapted fromMcClelland and Rumelhart, 1986).Name Gang Age Education Marital Status Occupation

Robin Jets 30's College Single Pusher Bill Jets 40's College Single Pusher Mike Jets 20's H.S. Single Pusher Joan Jets 20's J.H. Single Pusher Catherine Jets 20's College Married Pusher John Jets 20's College Divorced Pusher Joshua Jets 20's College Single Bookie

Bert Jets 20's College Single Burglar Margaret Sharks 30's J.H Married BookieJanet Sharks 20's J.H. Married BookieAlfred Sharks 40's H.S. Married BookieGerry Sharks 40's College Married BookieBrett Sharks 40's J.H. Single BookieSandra Sharks 40's J.H Divorced BookieBeth Sharks 40's J.H. Married Pusher Maria Sharks 40's J.H. Married Burglar

To load the network in Figure 1 into Brainwave, select the "Jets and

Sharks (BackPropagation)" option from the Networks menu. If thisoption is not included in the Networks menu of your version of BrainWave, you can load the network into the workspace directly froma URL using the Open URL option in the File menu of the Simulator.

The URL for the BackPropagation version of the Jets and Sharksexample in this tutorial is:http://psy.uq.edu.au/~brainwav/Manual/BPJetsAndSharks.net



After the Jets and Sharks network has been loaded, the workspace willcontain two input sets and two output sets in addition to the networkarchitecture and parameters. The leftmost input and output sets areused for training the network on the information in Table 1. Above thetarget output set is a global value labelled "Error" which monitors the

total summed-squared error (TSS) of the network on the training set.In order for a BackProp network to learn properly, you need to firstrandomize its weights. Don't worry, you don't have to do this by hand!

To randomize the weights, select the "Randomize Weights and Biases"action in the Actions menu.

To cycle a BackProp network, select an input and an output set, andclick the Cycle button. The input activations will then propagateforward to the output layer. To apply a specific pattern to the inputs,click on the label for that pattern in the respective input set, or changethe input activations directly (by either toggling units using the toggletool or selecting units and then editing their activation in the

parameter panel).Exercise 1: Randomize the weights and biases. Cycle the network oneach of the input patterns in the training set and inspect theactivations on the output units. For each individual, record whether that individual is classified as a Jet or a Shark by comparing therelative activation levels of the Jets and Sharks units.As you can see by comparing the actual outputs with the correctoutputs, the performance of the network is poor. This is because thenetwork has not been trained yet.Exercise 2: Does the untrained network have a classification bias?Select the input and output sets that you would like the network to

associate, and click on the learn button. Note that the input and outputsets only need to be selected if there is more than one of each. Duringlearning, the weights of the network are adjusted so that the total erroron the selected output set is reduced. You are now ready to explorelearning in a BackProp network.When applying BackProp learning, it is convenient to be able tomonitor the progress of learning over time. One way to do this is tocreate a graph of the total error.Exercise 3: Use the graph tool to graph the output set error. Changethe maximum value that can be displayed in the graph from 1 to 10.Now, randomize the weights and biases of the Jets and Sharks network

and then train it for 40 epochs. What is the error on the output set after 40 epochs?Exercise 4: After the network has been trained for 40 epochs, re-test it on the patterns in the training set. Evaluate the network's ability at this stage to classify the gang members as Jets or Sharks? Do any of the associations seem more difficult to learn than others?Exercise 5: Continue training the network until to the total error is lessthan 0.4. Test the network on the patterns in the training set to



confirm that it can make the correct classifications. What is the shapeof the error curve (or learning curve) over the course of training?In this introductory section, we have provided a gentle overview of training and testing BackProp networks in Brainwave using a the Jetsand Sharks problem as an example. In the next section, we consider

the pattern classification problem in more detail. As we shall discover,two-layer networks consisting of an input and an output layer havespecific limitations in the types of problems they can solve. However,by adding an "extra" hidden layer, neural networks can potentially mapany set of inputs to any set of outputs. Unfortunately, a seriousproblem arises concerning how to learn that mapping. The BackPropalgorithm was developed by Rumelhart, Hinton and Williams (1986) toaddress the learning problem for multilayer networks. After havingproperly motivated BackProp, we examine the feedforward dynamicsof a BackProp network and rederive the BackProp learning algorithm.Next we consider some simulation issues in applying BackProp

networks to problem. To conclude the tutorial, we return to the Jetsand Sharks example to addess the question of how a BackProp networkcan be used to make predictions.

Background

Consider a simple neural network made up of two inputs connected toa single output unit (Figure 2). The output of the network is determinedby calculating a weighted sum of its two inputs and comparing thisvalue with a threshold . If the net input (net) is greater than thethreshold, the output is 1, otherwise it is 0. Mathematically, we cansummarize the computation performed by the output unit as follows:

net = w 1I1 + w 2I2 if net > then o = 1, otherwise o = 0.

Suppose that the output unit performs a logical AND operation on itstwo inputs (shown in Figure 2). One way to think about the ANDoperation is that it is a classification decision. We can imagine that all

Jets and Sharks gang members can be identified on the basis of twocharacteristics: their marital status (single or married) and theiroccupation (pusher or bookie). We can present this information to oursimple network as a 2-dimensional binary input vector where the firstelement of the vector indicates marital status (single = 0 / married =1) and the second element indicates occupation (pusher = 0 andbookie = 1). At the output, the Jets gang members comprise "class 0"and the Sharks gang members comprise "class 1". By applying theAND operator to the inputs, we classify an individual as a member of the Shark's gang only if they are both married AND a bookie; i.e., theoutput is 1 only when both of the inputs are 1.



Figure 2: A simple two-layer network applied to the AND problem The AND function is easy to implement in our simple network. Basedon the network equations, there are four inequalities that must besatisfied:

w 10 + w 20 <w 10 + w 21 <w 11 + w 20 <w 11 + w 21 >

Here's one possible solution. If both weights are set to 1 and the threshold is set to 1.5,then

(1)(0) + (1)(0) < 1.5 ==> 0(1)(0) + (1)(1) < 1.5 ==> 0(1)(1) + (1)(0) < 1.5 ==> 0(1)(1) + (1)(1) > 1.5 ==> 1

Although it is straightforward to explicitly calculate a solution to the

AND problem, an obvious question concerns how the network mightlearn such a solution. That is, given random values for the weights canwe define an incremental procedure which will converge to a set of weights which implements AND.

Simple Learning Machines

One of the earliest learning networks was proposed by Rosenblatt inthe late 1950's. The task of Rosenblatt's "perceptron" was to discovera set of connection weights which correctly classified a set of binaryinput vectors.

The basic architecture of the perceptron is similar to the simple ANDnetwork in the previous example (Figure 2). It consists of a set of inputunits and a single output unit. As in the simple AND network, theoutput of the perceptron is calculated by comparing the net input:

net = iw iIi.with a threshold (If the net input is greater than the threshold , thenthe output unit is turned on, otherwise it is turned off).



To address the learning question, Rosenblatt solved two problems.First, it was necessary to define a cost function which measurederror. Second, it was necessary to define a procedure which reducedthat error by appropriately adjusting each of the weights in thenetwork. However, this requires a procedure (or learning rule) which

assesses the relative contribution of each weight to the total error.Rosenblatt decided to measure error by determining the differencebetween the actual output of the network with the target output (0 or1). The learning rule that Roseblatt developed based on this errormeasure can be summarized as follows.If the input vector is correctly classified (zero error), then the weightsare left unchanged, and the next input vector is presented. However,when the input vector is incorrectly classified, there are two cases toconsider. If the output unit is 1 when the correct classification is 0,then the threshold is incremented by 1 to make it less likely that theoutput unit would be turned on if the same input vector was presented

again. If the input I i is 0, then the corresponding weight w i is leftunchanged. However, when the input I i is 1, then the correspondingweight is decreased by 1 so that the next time the input vector ispresented, that weight is less likely to contribute to false classification.On the other hand, when the output is turned off when it should beturned on, the opposite changes are made.Formally, the perceptron learning rule is defined by two equations. Thechange in the threshold is given by

= -(t p - o p) = - d p where p specifies the presented input pattern, t p specifies the correctclassification of the input pattern, and d p is the difference between the

target and actual outputs. The change in the weights are given byw i = (t p - o p)I pi = d pI pi

Exercise 6: Apply the perceptron learning rule to solve the AND problem for w 1 = -0.5, w 2 = 0.5, and = 1.5. Record the weights andthreshold after each step of learning, applying the input patterns in thesame order as in Figure 2. Make a table of the results. After one step of learning your table should look like:

Input1 Input2 Target Output Weight1 Weight2 Threshold1 1 1 0 -0.5 0.5 1.51 0 0 0 0.5 1.5 0.5

How many steps are required to solve the AND problem?A remarkable property of the perceptron learning rule is that it isalways able to discover a set of weights that correctly classifies itsinputs, given that the set of weights exists . This important resultknown as the Perceptron Convergence Theorem (which extends tomultiple output networks) fueled much of the early interest in neuralnetworks.Unfortunately, there are limitations to the types of problems that canbe solved by a perceptron.



Why a Hidden Layer?

The limitations of perception were documented by Minsky and Papertin their book Perceptrons (Minksy and Papert, 1969). The now classicexample of a simple function that can not be computed by a

perceptron (or any two layer network) is the exclusive-or (XOR)problem (Figure 3).

Figure 3: A simple two-layer network applied to the XOR Problem.A verbal description of the XOR problem (Figure 3) is that the outputmust be turned on when either of the inputs is turned on, but not whenboth are turned on. Four inequalities must be satisfied for theperceptron to solve this problem (as was the case for AND):

(0 w 1) + (0 w 2) < ==> 0 <(0 w 1) + (1 w 2) < ==> w 2 >(1 w 1) + (0 w 2) < ==> w 1 >(1 w 1) + (1 w 2) < ==> w 1 + w 2 <

However, this is obviously not possible since both w 1 and w 2 wouldhave to be greater than while their sum w 1 + w 2 is less than . There isan elegant geometric description of the types of problems that can besolved by a perceptron.Consider three two-dimensional problems: AND, OR, and XOR. If werepresent these problems geometrically (Figure 4), the binary inputspatterns form the vertices of a square. The input patterns 00 and 10form the bottom left and top left corners of the square and thepatterns 01 and 11 form the bottom right and top right corners of thesquare.



Figure 4: Geometric representation of AND, OR, and XOR.From a geometrical perspective, the perceptron attempts to solve theAND, OR, and XOR problems by using a straight line to separate thetwo classes: inputs labelled "0" are on one side of the line and inputslabelled "1" are on the other side. For the AND and OR functions, this iseasy to do whereas for XOR it is not. The line separating the two

classes is determined by net = . For two-diminensional problems suchas AND, OR, and XOR, the line corresponds to(I1 w1) + (I 2 w2) =

Solving for I 2 yieldsI2 = -(w 1/w 2)I1 + ( /w 2).

In higher dimensions the boundary separating the two classes is ahyperplane

net = iw iIi.All problems which can be solved by separating the two classes with ahyperplane are called linearly separable . The XOR problem (aspresented) is not a linearly separable problem.

However, we can recode the XOR problem in three dimensions so thatit becomes linearly separable (Figure 5).



Figure 5: Three-dimensional version of the XOR problem.In the three-dimensional version, the first two inputs are exactly thesame as the original XOR and the third input is the AND of the first two.

That is, input3 is only on when input1 and input2 are also on. Byadding the appropriate extra feature, it is possible to separate the twoclasses with the resulting plane. Instead of recoding the inputrepresentation, another way to make a problem become linearlyseparable is to add an extra (hidden) layer between the inputs and theoutputs. Given a sufficient number of hidden units, it is possible torecode any unsolvable problem into a linearly separable one.Exercise 7: Assuming = 1, find a set of weights which solves the 3Dversion of the XOR problem.

The Credit Assignment Problem in MultiLayer Networks

In the three-dimensional version of XOR network, we showed that a pre-determined extrainput feature makes the XOR problem linearly separable. Consider when we add an extralayer to the network instead. The hidden layer provides a pool of units from whichfeatures (which help distinguish the inputs) can potentially develop. However, theoutstanding question concerns how to learn those features. That is, how can a network develop an internal representation of a pattern?Since there are no target activations for the hidden units, theperceptron learning rule does not extend to multilayer networks, Theproblem of how to train the hidden-unit weights is an acute problem of credit assignment. How can we determine the extent to which hidden-unit weights contribute to the error at the output , when there is not adirect error signal for these units. The BackProp algorithm provides asolution to this credit assignment problem.

The Backpropagation Algorithm

The Backpropagation algorithm was first proposed by Paul Werbos in the 1970's.However, it wasn't until it was rediscoved in 1986 by Rumelhart and McClelland thatBackProp became widely used.



Feedforward Dynamics

When a BackProp network is cycled, the activations of the input units are propagatedforward to the output layer through the connecting weights. Like the perceptron, the netinput to a unit is determined by the weighted sum of its inputs:

net j = iw jia i where a i is the input activation from unit i and w ji is the weightconnecting unit i to unit j. However, instead of calculating a binaryoutput, the net input is added to the unit's bias and the resulting valueis passed through a sigmoid function:

F(net j) = 1/(1 + e -net j

+ j ).

The bias term plays the same role as the threshold in the perceptron.But unlike binary output of the perceptron, the output of a sigmoid is acontinuous real-value between 0 and 1. The sigmoid function issometimes called a "squashing" function because it maps its inputsonto a fixed range.

Figure 6: Sigmoid Activation FunctionExercise 8: Calculate the output of the sigmoid function when the net input and bias are zero.

Gradient Descent

Learning in a backpropagation network is in two steps. First each

pattern Ip is presented to the network and propagated forward to theoutput. Second, a method called gradient descent is used to minimizethe total error on the patterns in the training set. In gradient descent,weights are changed in proportion to the negative of an errorderivative with respect to each weight:

w ji = - [ E/ w ji]Weights move in the direction of steepest descent on the error surface defined by the totalerror (summed across patterns):



E = 1/2 p j(t pj-o pj) 2 where o pj be the activation of output unit u j in response to pattern p and t pj is the target output value for unit u j Figure 7 illustrates the concept of gradient descent using a singleweight. After the error on each pattern is computed, each weight is

adjusted in proportion to the calculated error gradient backpropagatedfrom the outputs to the inputs. The changes in the weights reduce theoverall error in the network.

Figure 7: Typical curve showing relationship between overall error andchanges in a single weight in a network (reproduced from McClellandand Rumelhart, 1988).

Derivation of the BackProp Learning Rule

In order to derive the BackProp learning rule, we use the chain rule to rewrite the error gradient for each pattern as the product of two partial derivatives. The first partialderivative reflects the change in error as a function of the net input; the second partialderivative reflects the effect of a weight change on a change in the net input. Thus, theerror gradient becomes

E p / w ji = [ E p / net pj] [ net pj / w ji]Since we know what the equation for the net input to a unit is, we cancalculate the second partial derivative directly:

net pj / w ji = ( k w jk o pk )/ w ji = o pi.We will call the negative of the first partial derivative the error signal:

d pj = - E p / net pj. Thus, the appropriate change in the weight w ij with respect to the errorE p can be written as

pw ji = d pjo pj In this equation, the parameter is called the learning rate.

The next (and final) task in the derivation of the BackProp learning ruleis to determine what d pj should be for each unit it the network. It turnsout that there is a simple recursive computation of the d terms which



can be calculated by backpropagating the error from the outputs to theinputs. However, to compute d pj we need to again apply the chain rule.

The error derivatived pj

can be rewritten as the product of two partial derivatives:

d pj = - ( E p/ o pj) ( o pj/ net pj)Consider the calculation of the second factor first. Sinceo pj = f(net pj),o pj/ net pj = f'(net pj).

The derivative of the sigmoid function has an elegant derivation.Exercise 9: Demonstrate that the derivative of the sigmoid activationfunction with respect to the net input simplifies to f(net pj )(1 - f(net pj )).

To calculate the first partial derivative there are two cases to consider.Case 1: Suppose that u j is an output unit of the network, then itfollows directly from the definition of E p that

E p/ o pj = -2(t pj - o pj)

If we substitute this back into the equation for d pj we obtaind pj = 2(t pj - o pj)f'(net pj)Case 2: Suppose that u j is not an output unit of the network, then weagain use the chain rule to write

E p/ o pj = k [ E p/ net pk ] [ net pk / o pj]= k [ E p/ net pk ] [ ( iwkio pi)/ o pj]= k [ E p/ net pk ] w kj = - k d pk wkj

Combined, cases 1 and 2 provide a recursive procedure for computingd pj for all units in the network which can then be used to update itsweights.

The backpropagation learning algorithm can be summarized as follows.During learning, the weight on each connection are changed by anamount that is proportional to an error signal d . Using gradient descenton the error E p , the weight change for the weight connecting unit u i tou j is given by

pw ji = d pja pj where is the learning rate. When u j is an output unit, then the errorsignal d pj is given by case 1 (the base case). Otherwise d pj isdetermined recursively according to case 2.

Simulation IssuesHow to Select Initial Weights

Now that we waded through all of the details of the backpropagation learning equations,let us consider how we should choose the initial weights for our network. Suppose that allof the weights start out at equal values. If the solution to the problem requires that thenetwork learn unequal weights, then having equal weights to start with will prevent the



network from learning. To see why this is the case, recall that the error backpropagatedthrough the network is proportional to the value of the weights. If all the weights are thesame, then the backpropaged errors will be the same, and consequently all of the weightswill be updated by the same amount. To avoid this symmetry problem, the initial weightsto the network should be unequal.

Another issue to consider is the magnitude of the initial weights. Whenthe randomize weights action is selected from the actions menu inBrainWave, all of the network weights and biases are initialized tosmall random values. What advantage is there in randomized theweights to small values? Reconsider the derivative of the sigmoidactivation function. In Exercise 9, we showed that the derivative of thesigmoid isf'(net pj) = f(net pj)(1 - f(net pj ))Since the weights updates in the BackProp algorithm are proportionalto this derivative, it is important to consider how the net input affectsits value. The derivative is a maximum when f(net pj) is equal to 0.5 and

aproaches its minimum as f(net pj) approaches 0 or 1. Thus, weights willbe changed most for unit activations closest to 0.5 (the steepestportion of the sigmoid). Once a unit's activation becomes close to 0 or1, it is committed to being on or off, and its associated weights willchange very little. It is therefore important to select small initialweights so that all of the units are uncommitted (having activationsthat are all close to 0.5 -- the point of maximal weight change).

Local Minima

Since backpropagation uses a gradient-descent procedure, a BackProp network follows

the contour of an error surface with weight updates moving it in the direction of steepestdescent. For simple two-layer networks (without a hidden layer), the error surface is bowlshaped and using gradient-descent to minimize error is not a problem; the network willalways find an errorless solution (at the bottom of the bowl). Such errorless solutions arecalled global minima . However, when an extra hidden layer is added to solve moredifficult problems, the possibility arises for complex error surfaces which contain manyminima. Since some minima are deeper than others, it is possible that gradient descentwill not find a global minima. Instead, the network may fall into local minima whichrepresent suboptimal solutions.Obviously, we would like to avoid local minima when training aBackProp network. In some case this may be difficult to do. However,

in practice it is important to try to assess how frequently and underwhat conditions local minima occur, and to examine possible strategiesfor avoiding them. As a general rule of thumb, the more hidden unitsyou have in a network the less likely you are to encounter a localminimum during training. Although additional hidden units increase thecomplexity of the error surface, the extra dimensionalilty increases thenumber of possible escape routes.



The concept of a local minimum can be demonstrated with a verysimple network consisting of one input unit, one hidden unit, and oneoutput unit (Figure 8). The task of this network, called the 1:1:1network by McClelland and Rumelhart (1988), is simply to copy thevalue on the input unit to the output unit.

Figure 8: The 1:1:1 Network (McClelland and Rumelhart, 1988) The following exercises explore the concept of local minima using the1:1:1 network. Even in this simple network, it is possible to havemultiple minima and get stuck with a subpar solution.Exercise 10: Select the "New BackPropagation Network" option fromthe Networks menu to clear the workspace and load theBackPropagation equations. Create the 1:1:1 network shown in Figure8 and attach global values to each of the weights to monitor their values.Exercise 11: There are two pairs of input and output patterns in thetraining set for the copy task (see the input and output sets in Figure8). Create two sets. The first set (labelled input set) should contain theinput unit, and the second set (labelled output set) should contain theoutput unit. Make sure that the units that you want to be in a set areselected before you create the set. Note: the preferred method for creating input and output sets is through the Actions menu. The"Create Empty Input/Output Set" actions will appropriately label thetwo sets. Add the 0->0 and 1->1 input/output pairs to the two sets.Use the value tool to attach a global value (labelled "Error") to theoutput set to monitor the total-summed-squared error (TSS).Exercise 12: Once you have created the 1:1:1 network, you are ready to train it. To randomize the network weights and biases, select the"Randomize Weights and Biases" action from the Actions menu. Nowtrain the network for 200 epochs and record the results under theheading "Simulation 1" (constructing a table similar to the one shownbelow). Record the weight values w1 and w2, the bias values for thehidden and output units, the final error, and whether the network finds



a global or local minimum. Finally, describe how your network solvedthe copy task in this simulation.

Simulation Minima Error w1 w2 bias1 bias21. Global 0.012 4.75 5.9 -2.31 -2.89

Exercise 13: For the copy task, it turns out that there are two global

minima. Repeat Exercise 12 (possibly more than once) to find analternative solution. Record the results for the second solution in your simulation table.So far, we haven't found any local minima in the 1:1:1 network.However, if we fix (freeze) the biases on the hidden and output units atzero, we change the system so that instead of two global minima, itnow has one local and one global minima (neither of which is optimal).Exercise 14: Set the Biases on the hidden and output units to zero, andthen set the BiasFrozen parameter for these units to true (Note that

you BrainWave does not allow you to change the Bias value and freezeit at the same time). Now, rerun Exercise 12 to find the local andglobal minima in the frozen bias version of the 1:1:1 network. Record

your results in the table as Simulations 3 and 4.

Learning Rate and Momentum

The Backpropagation algorithm developed in this chapter only requires that the weightchanges be proportional to the derivative of the error. The larger the learning rate thelarger the the weight changes on each epoch, and the quicker the network learns.However, the size of the learning rate can also influence whether the network achieves astable solution. If the learning rate gets too large, then the weight changes no longer approximate a gradient descent procedure. (True gradient descent requires infinitesimalsteps). Oscillation of the weights is often the result.Ideally then, we would like to use the largest learning rate possiblewithout triggering oscillation. This would offer the most rapid learningand the least amount of time spent waiting at the computer for thenetwork to train. One method that has been proposed is a slightmodification of the backpropagation algorithm so that it includes amomentum term.Applied to backpropagation, the concept of momentum is that previouschanges in the weights should influence the current direction of movement in weight space. This concept is implemented by therevised weight-update rule:

w ji(n+1) = d pja pi + w ji(n)where the subscript n is the learning epoch. With momentum, once theweights start moving in a particular direction in weight space, theytend to continue moving in that direction. Imagine a ball rolling down ahill that gets stuck in a depression half way down the hill. If the ball hasenough momentum, it will be able to roll through the depression andcontinue down the hill. Similarly, when applied to weights in a network,



momentum can help the network "roll past" a local minima, as well asspeed learning (especially along long flat error surfaces).In BrainWave, the default learning rate is 0.25 and the defaultmomentum parameter is 0.9. When applying backpropagation to arange of problems, you will often want to use much smaller values

than these. For especially difficult tasks, a learning rate of 0.01 is notuncommon.

When to Update Weights

Updating the weights in a BackProp network can be done either after the presentation of each pattern (pattern learning), or after all of the patterns in the training set have been

presented (epoch learning). Weights updates in BrainWave are by pattern. If the learningrate is small, there is little differences between the two procedures. However, substantialdifferences can be observed when the learning rate is large, as the derivation of the

backpropagation algorithm assumes that the error derivatives are summed over all of the

patterns.

The Development of Internal Representations

Let's now explore how a BackProp network learns to solve the XORproblem. To load the XOR problem into BrainWave, select the"Exclusive-Or Problem (BackPropagation)" option in the Network'smenu. After loading, the workspace should look like Figure 9. This is aslightly fancier version of the XOR problem in presentation only. Here,the network must learn to correctly associate different binarycombinations of "odor" and "light" stimuli with an expectation for foodor an electric shock. Analogous to XOR, food can be expected whenthere is a light but no odor, or an odor but no light. A shock is to beexpected with their is no odor and no light, or both an odor and a light.

Figure 9: The XOR Network



Exercise 15: Randomize the weights and biases, and record the output for each of the input patterns. Explain why the output and hidden unit activations are all so similar. Hint: In your explanation refer to thestrenghts of the connection weights and how that effects the dynamicsof the sigmoid activation function.

Exercise 16: Create a graph of the total error and train the network for 100 epochs. What is the TSS error after 100 epochs? Continue trainingthe network in 100 epoch steps until the TSS is less than 0.05.Describe the learning curve of the network over the course of trainingand record the total number of epochs. Note, it is possible that duringtraining your network will get stuck in a local minima. If this happens,re-randomize the weights and start again.When the network sucessfully learns the XOR task, you should noticethat it takes a relatively long time to do so. In fact, it may seem asthough the network learns very little during the first several hundredepochs of training,but then accelerates its learning very quickly.

BackProp learning of XOR can be conceptualized as a two stageprocess. In the first (slow) stage, the network learns to recode the XORproblem so that it is easier to solve. In the slow stage of learning, thenetwork is developing an internal representation of the XOR problemthat is linearly separable. Once the XOR problem has been successfullyrecoded, learning proceeds quickly.Exercise 17: Record the output and hidden unit activations of thetraining network for each of the input patterns. How has the network internally represented the XOR problem?

There are different ways to recode the XOR problem so that it islinearly separable. The solution that the network finds depends on the

precise values of the initial weights and biases.So far, we have not considered the effects of changing the learningrate and momentum parameters.Exercise 18: Try retraining the network with a learning rate of 1.0. Do

you see an appreciable change in the speed of learning? What aresome possible bad side effects of increasing the learning rate?Exercise 19: Now try retraining the network after setting themomentum to zero. You will notice that without momentum, BackPropis much slower to solve the XOR problem. What feature(s) of the error surface might explain this?

Making Predictions: The Jets and SharksRevisited

Now that you have a more in depth understanding of how BackProp networks learn byexample, let's return again to the Jets and Sharks example to address the question of generalization. To do this, we need to examine the ability of a BackProp network toclassify input patterns that are not in the training set. If we are confident of a network's



ability to correctly classify novel instances, we can use the network to make classification predictions. In the introductory section, we trained a BackProp network to make gangclassification judgments. Here, we test its ability to make such judgments for individualsthat are not in the training set.Exercise 20: Load the BackProp version of the Jets and Sharks example

as before, and train it for 40 epochs (repeating Exercise 1).It is easy to imagine a scenario where we have been contracted todevelop a reliable method which helps police identify gang membersand what gang they belong too. Suppose that to help us devise amethod for predicting gang membership, the police have given usaccess to their database of known gang members and that thisdatabase is exactly the data set that we trained the network on.Consider Table 2 which contains data for four individuals not in thetraining set. We can imagine that these individuals have recently beenarrested and that we have access to their age, education, maritalstatus, and occupation, but not to what gang they belong to. However,

in order to help with interrogation, the police need to make aneducated guess about whether each suspect is a Jet or a Shark. Howdo we know whether our network will be a reliable predictor for thepolice? A simple answer is that we can't know for sure.

Name Gang Age Education Marital Status OccupationGeorge ? 20's College Single Pusher Linda ? 40's J.H. Married BookieBob ? 30's H.S. Divorced Burglar Michelle ? 40's College Married Pusher

Table 2: Jets or Sharks?However, by using the relative activations of the Jets and Sharks unitswe can make both strong and weak predictions. When the Jetactivation is large (near 1.0) and the Shark activation is small (near0.0), then, the network is making a strong prediction that the suspectis a Jet, whereas the reverse prediction would be made when the Jetactivation is smalland the Shark activation is large. In some cases, thenetwork may be unable to make a strong prediction either way (e.g.,when both the Jets and Sharks activations are near 0.5).Each new piece of information added to the database can be comparedwith the network's predictions to evaluate how well it generalizes.Demonstrating that the network can successfully generalize suggeststhat it is likely to be a reliable predictor. And with each mistake, wecan hopefully use that piece of information to improve the network'sperformance in future classifications.

The test suite of patterns included in the BrainWave workspacecontains the four individuals in Table 2 plus separate test patterns foreach characteristic. For example, the 20's input pattern has the 20'sunit turned on, and all of the rest of the input units turned off. Asecond output set was included with each output labelled "****" to



indicate that the correct classification for the test patterns is notknown.Exercise 21: Test the network on George, Linda, Bob, and Michelle. For each, record the output activations in a table and label the network's

predictions as "Strong Jet", "Weak Jet", "Strong Shark", or "Weak

Shark".Exercise 22: Use the network to make predictions for the 12 remainingtest patterns. The output activation for these patterns is a predictionthat an individual is a Jet or a Shark based on only knowing a single

piece of information (i.e., what their age OR educational level ORmarital status OR occuption is, but not more than one of these.).Exercise 23: Based on the results from Exercise 16, explain why strong

predictions are made for George and Linda. Suppose that we discover that George is a Shark. How does this new piece of information affect

your confidence in the model? Would the network be able to easily learn this new piece of data?

Exercise 24: Similarly, explain why weak predictions are made for Boband Michelle. Would it be more difficult for the network to learn that Bob is a Jet or that Michelle is a Jet.

This concludes the tutorial on BackProp networks. However, if youwould like to explore the BackProp network further, there are severaladditional examples included in the Networks menu.

References

McClelland, J. L. & Rumelhart, D. E. (Eds.). (1988). Explorations in parallel distributed processing: A handbook of models, programs, and exercises. Cambridge, MA: MITPress.Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed

processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press.

The BrainWave homepage.

http://psy.uq.edu.au/~brainwav

the backpropagation network-1

Documents