back prop theory

Upload: neha-baranwal

Post on 03-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Back Prop Theory

    1/13

    Theory of the Backpropagation Neural NetworkRobert Hecht-Nielsen

    HNC, Inc.5501 Oberlin DriveSan Diego, CA 92121619-546-8877andDepartment of Electrical and Computer EngineeringUniversity of Caliiomia at San DiegoLa Jolla, CA 92139

    Abst ractBackpropagation is currently the most widely applied neural network arch itecture. Tb e information processing operation that it ca rries out is the approximationof a mapping or function f :A C R - , from a bounded subset A of n-dimensional Euclidean space to a bonnded subset AA]of m-dimensional Euclideanspac e, by means of training on examples (XI, l), XZ , y ~ ) , Xt. yt) , . of the mapping's action, where yk = f ( X k ) . It is assumed that such examples aregenerated by selecting XI vectors randomly from A in accordance with a fixed probability density function Ax). This pape r presents a survey of the basic theoryof the backpropagation neural network arch itecture covering the areas of: architectnral design, performance measuremen t, unction approximation capability. andlearning. The survey includes previously known material, as well as some new results: a formulation of the b ackpr opag ation neura l network architectu re to makeit a valid ne ural network (past fo rmnla tions violated the locality of processing restriction) and a proof that the backpropagation mean sqnared error functionexists and is differentiable. Also inclnded is a theorem showing that any Lz function from O, I]" to R can be implemenkd to any desired degree of .ccuracywith a threclay er backpropaga tion nenral network. Finally, an Appe ndix presents a specnlative nenrophysiologicalmodel illustrating how the backpropagationneur al network architec ture might plausibly be implemented in the mam malian brain for corticwc ortical earnin g between nearby regions of cerebral cortex.

    1 IntroductionWithout question, backpropagation is currently the most widely applied neural network architecture. This popularity primarily revolves around theability of backpropagation networks to learn complicated multidimensional ma ppin g. One way to look a t this ability is that, in the words of Werbos[49,50,52], backpropagation goes 'Beyond Regression .Backpropagation has a colorful history. Apparently, it was originally introdu ced by Bryson and Bo in 1969 [5] and independently rediscovered byWerbos in 1974 (521, by Parker in the mid 1980's [41,39,40] and by Rumelhart, Williams and other members of the PDP group in 1985 [46,44,1].Although the PDP group became aware of Parker's work shortly after their discovery (they cited Parker's 1985 report in their first papers onbackpropagation [59,46]), Werbos' work was not widely appreciated until mid-1987, when i t was found by Parker. Th e work of B r p n and HO wasdiscovered in 1988 by le Cun [34]. Even earlier incarnations may yet emerge.Notwithstanding its checkered history, the re is no question th at credit for developinK backpropagation in to a usable technique, aswell as promulgatbnof the architecture to a large audience, rests entirely with Rumelhart an d the other members of the P D P group [45]. Before their work, backpro pagationwas unappreciated an d obscure. Today, it is a mainstay of neurocomputing.One of the crucial decisions in the design of the backpropagation architecture is the selection of a sigmoidal activation function (see Section 2 below).Historically, sigmoidal activation functio ns have been used by a number of investigators. GroaPberg was the first advocate of the use of sigmoidfunctions in neural networks [18]; although his reasons for using them are not closely related t o their role in backpropag ation (see [E] for a discussionof the relationship between these two bodies of work). Sejnowski, Hinton, and Ackley [21,45] and Hopfield [23] provided still other reasons for usingsigmoidal activation functions, but a gain, these are not directly related to backp ropagation. T he choice of sigmoid activation for backpropagation (atleast for the PD P group reincarnation of the a rchitecture) was made quite consciously by Williams, based upon his 1983 study of activation functions[60]. As we shall see, it was a propitious choice.

    2 Backpropagation Neural Network ArchitectureThis Section reviews the architecture of the backpropagation neural network. The transfer function equations for each processing element are providedfor both the forward and backward passes. First, we recall the definition of a neural network:Defin i t ion 1 A neural network is a pamllcl, distributed information proc essing structu re consisting of processing eleme nts (which can possess a localmemoy nd can carry out localized information processing operations) interconnected together with unidirectional signal channels called connections.Each processing element has a single output connection which branches ( fans out") into as many collateml connections as desired (each canyingthe same signal - the processing element outp ut signal). The processing element oatput sign al can be of an y mathe ma tica l type desired. All of theprocessing that goes on Mthin each processing elemeni must be completely local: i.e., it must depen d only upon the c a m n t values of the input signa lsam'uing at the processing element via impinging connections and upon values stored in the processing element a local memory.The importance of restating the neural network definition relates to the fact that (aspointed out by Carpenter and Grossberg [E]) traditional formsof the backpropagation architecture are, in fact, not neural networks. They violate the locality of processing restriction. The new backpropagationneural network architecture presented below eliminates this objection, while retaining the traditional mathematical form of the architecture.The backpropagation neural network architecture is a hierarchical design consisting of fully interconnected layers or rows of processing units (witheach unit itself comprised of several individual processing elements, as will be explained below). Backpropagation belongs to the class of mappingneuml network architectures and therefore the information processing function that it carries out is the approximation of a bounded mapping orfunction f : A C R -- Rm, rom a compact subset A of n-dimensional Euclidean space to a bounded subset f[A] of m-dimensional Euclideanspace, by means of training on examples (XI, y l ) , ( x z , y ~ ) , xk , yk), . of he mapping, where yk = f xh). It will always be assumed that suchexamples of a mapping f are generated by selecting xr vectors randomly from A in accordance w ith a fixed probability density function p ( x ) . T h e

    1-593

  • 8/12/2019 Back Prop Theory

    2/13

    operational use t o which the network is to be put after training is also assumed t o involve random selections of input vectors x in accordance withp ( x ) . Th e backpropagation architecture described in this paper is the basic, classical version. Many va riants of this basic form exist (see Section 5).The macro-scale detail of the backpropagation neural network architecture is shown in Figure 1. In general, the architecture consists of h ows ofprocessing units, numbered from the bottom up beginning with 1. For simplicity, the terms m w and layer will be used interchangeably in th is paper,even though each row will actually turn out to consist of two heterogeneous layers (where the term layer is used to denote a collection of processingelements having the same form of transfer functio n). The first layer consists of n fanout processing elements that simply accept the individualcomponents I of the input vector x and distribute them, without modification, to all of the un its of the second row. E ach unit on each row receivesthe output signal of each of the units of the row below. Thi s continues through all of the rows of the network until the final row. T he final Ii th)row of the network consists of m units an d Droduces the networks estimate yof the correct output vector y. For the purposes of this paper it willalways be assumed that I( 2 3. Rows 2 thr; I< - 1 . . . _are called hrdden rows (because they are not directly connected to the outside world).

    9 BFigure 1: Macroscopic Architecture of the Backpropagation Neural Network. The boxes are called units. Th e detailed arch itecture of the units iselaborated in Figure 2. Each row can have any number of units desired; except the output row, which must have exactly m units (because the ou tputvector must be m-dimensional).

    Besides the feedforward connections mentioned above, each unit of each hidden row receives an error feedback connection from each of the unit.sabove it. However, aswill be seen below, these are not merely fanned ou t copies of a broadc ast ou tput (as he forward connections are), but are eachseparate connections, each carrying a different signal. The details of the individual units (shown as rectangles in Figure 1) are revealed in Figure2 (which depicts two units on adjacent rows and shows all of their connections). Note tha t each unit is composed of a single sun processing elementand several planet processing elements. Each planet produces an o utpu t signal that is distributed to bo th its sun and to the sun of the previous layerthat supplied input to it. Each planet receives input from one of the suns of the previous layer as well as its own sun. As stated above, the hiddenrow suns receive input from one of the planets of each of the suns on the next higher row. The ou tpu t row sun s receive the correct answer yi fortheir component of the output vector on each training trial. As discussed in detail below, the network fu nctions in two stages: a forward pass and dbackward pass. A scheduling processing element (not show n) sends signals to each of the processing elemen ts of the network telling it when t o applyits processing element transfer function and whether to apply the forward pass part of it or the backward pass part of it. After the transfer functionis applied, the output signal is latched to the value determined during the update. This value is therefore constant until the next update. The exactequation s of the processing elements of the network are given in Table 1.Th e scheduling of the networks operation consists of two sweeps through th e network. The first sweep (th e fonuard pass) starts by inserting thevector xk into the networks first row, the input or fanout) layer. T he processing elements of the first layer transmit all of the components of x tt o all of the units of the second row of the network. The outputs of the units of row two are then transmitted to all of the units of row three, andso on, until finally the m output units (the units of the top, Kth, ow) emit the components of the vector y; (the networks estimate of the desiredoutput yt). After the estim ate y; is emi tted, each of the outp ut units is supplied with its component of the correct outpu t vector yk, tarting thesecond, backward, sweep through the network (the backward pass) . The output suns compute their 6 ~ i s nd transmit these to their planets. T heplanets then update their A ~ i jalues (or update their weights if enough batching has been carried out) and then transmit the values U J ~ : ~ ~ K ,o thesuns of the previous row. The superscript old indicates that the (non-updated) weight value used on the forward pass is used in this calculation.This process continues until the planets of row 2 (the first hidden layer) have been up date d. T he cycle can then be repeated. In short , each cycleconsists of the inputs to the network bubbling up from the bottom to the top and then the errors percolating down from the top to the bottom.

    1-594

  • 8/12/2019 Back Prop Theory

    3/13

    Forward PassPlanet j of processing element iof row 1 I = 2 ,3 , . . K):Input Used: z ( ~ - l ) j where z ( ~ - ~ p .0)Weight Value Used: wrilLocal Memory Value Used: NoneWeight and Local Memory Value Update:output : W f i j Z(I-l)jNone

    Backward Pass

    Input Used: z l i (= 61,Weight Value Used: wlLocal Memory Value Used: AI,Output : W 6f iWeight and Local Memory Value Update:IF (count = ba tchsize )THEN {,new - Wo d OAI.

    = ~ I ~ z ( I - I ) ~l j - 111 batchamcount = 1 }ELSE {w g w =A:;" = A; +Cjjz(j-l)jcount = count + 11

    where s ( 1 ) = 1/(1+ e-') is the sigmoid firncfion of the network, s = ds(Z)/dZ is the first derivative of the sigmoid function, Mi is the number ofunits on row I, cutaj is the weight of planet j of unit i of row I, z l i = zi (where zi i s the ith component of the input vector x), and the network'soutput s ignah z ~ ire equal to the components of the network's out put vector y - he network's estimate of the correct or desired outpu t y(the compon ents of which are supplied to the network for use in learning). Note tha t the out pu t signals of the ou tpu t suns can eithe r be sigmoidedor unsigmoided, as desired by the user. Further note that primes ( I are used to indicate both the correct o utput signals of the network as we11 asthe first derivative of the sigmoid function.Table 1: Processing element transfer functions for the backpropagation neural network architecture(planets, hidden sum, and output suns). For each processing element type the transfer function usebackward pass of the network is given.

    ing element are usedof the network and the

    Thi s process is continued until the network reaches a satisfacto ry level of performance (which is usually testtraining, but which was drawn a t random in the same way as he trainin g data); or until the user gives up an dor a new network configuration (see Section 5 below for a discussion of backpropagation network training ). Aa suitably low level of error the network can then be put th rough fu rther operation al testing to qualify it for deployment. Since training is typicallynot used in the deployed version, the backward pass can be eliminated - allowingBackpropagationError Surfaces

    Given a function f, n associated x-selection probability den sity function p, and anof layers K , and a given number of units MI er hidden layer, with either linear ora means for measuring t he accuracy of this approximation as a function of the netFirst, define w (the weight vector of the network) to be the vector with componstarting with the weight of the first planet of the first processing element of layer 2 (the first hidden layer) and ending with the weight of the lastplanet of the last processing element of the outp ut layer K. To make things simple we shall refer t o the w 2 . ; ra ther than) shall be called B.zio, ~ ~ 1 1 , . o make our notation simple the vector z K ~ ,K ~ , ZK,,,) the network's estimate y he example used onote that B is a function of the input vector x and t he network weight vector w. Thus, we will write Bthe kth twting trial (i.e., yk = f(xt)). As before, the xr s are drawn from A in accordance with a fixed ion p.Given the above, let F k = I f ( x r )- B ( x r ,w)I2. Fr is the square of the approximation error made on the k testing trial.discussion below we sha ll assume tha t w is fixed (i.e., b atchsiz e is set t o CO to shut off learning). Then we define F(w) t o b

    Because the mapping f is bounded the set of all y k ' s are bounded. Thus, it is easy to show that this limit exists because (for a fixed w) B is atinuous mapping from the compact set A into Rm, nd thus the set of all B's are bounded also. Thus, the variance of the random variable Fk isunded. So, by Kolmogorov's probability theorem [4}, the random variable Fr obeys the stron g law of arge numbers. Th erefore, the above F summust almost surely converge t o the expected value F ( w ) of 4 . We call F ( w ) the mean squared e r r o r function ofshortened to errorfunctron. Note that F ( w ) 2 0 because F is the average of non-negative quantities.

    1-595

  • 8/12/2019 Back Prop Theory

    4/13

    1o

    Row

    Row 1

    (1+1)

    Figure 2: Architectural detail an d interaction of two backpropagation network units on two adjacent rows (row 1 below and row l + 1 above). Eachunit consists of a sun and several planets. Each hidden row sun sends its output to one planet of each unit of the next row. Each planet that receivesan input from the previous row (the bias planets all receive the constant input 1.0 from a single processing element which is not shown)sends a connection back to the sun th at supplied t hat input. I t is through these paths tha t the products of the errors times the weights (wt i j6, i) areback-propagated.

    The ermr surface of a backpropagation network is the surface defined by the equation F = F(w) in the Q + 1-dimensional space of v ectors (w, F),where Q s the number of dimensions in the vector w i.e., the number of planets in the network). Th e variable w anges over its @dimensionalspace and for each w non-negative surface height F is defined by F ( w ) . n othe r words, given any selection of weights w, the network will make anaverage squared error F w) n its approximation of the function f . We now consider the sh ape of this error surface.As will be shown below, the generalized delta rule learning law used with the backpropagation neural network has the property that, given anystarting point WO on the error surface that is not a minimum, the learning law will modify the weight vector w so tha t F(w) ill decrease. In oth erwords, the learning law uses examples provided during tra inin g to decide how to modify the weight vector so tha t the network will do a be tter job ofapproximating the function f . Given this behavior, the next issue is t o asses how valuable this property is.Until recently, the s hap e of backpropagation erro r surfaces was largely a mystery. T wo basic facts have emerged so far. First, experience has shownthat many backpropagation error surfaces are dominated by flat areas and troughs that have very little slope. In these areas it is necessary to movethe weight value a considerable distance before a significant drop in error occurs. Since the slope is shallow it turns out that the generalized deltarule has a hard time determining which way to move the weight to reduce the error. Often, great numerical precision (e.g., 32-bit floating point) andpatience must b e employed to make significant progress.The other basic fact ab out error surfaces that has emerged concerns the existence of local minima. Until recently, it was not known for certain whetherbackpropagation error surfaces have local minima at error levels above the levels of the global minima of the surfaces (due to weight permutationsthere are always many global minima). Experience suggested th at such minima m ight not exist because usually when training failed to make downhillprogress and he error level was high it was discovered that fu rthe r patience (or the use of one of the training augmen tations described in Section 5 below)would eventually lead to th e weight moving away from what was clearly a shallow spot on t he surface and on to a steeper part . T hus , it was somewhatsurprising when M cInerney, Haines, Biafore, and H echt-Nielsen [36] iscovered a local minimum (at a very high error level) in a backpropagationerror surface in June 1988. Finding this local minimum was not easy. It required the use of the Symbolics Corporations MacsymaTM mathem aticsexpert system running on the largest Digital Equipment Corporation VAXTM computer to generate approximately a megabyte of FORTRAN code(the closed-form formulas for the error function a nd all of its first and second partial derivatives with respect to the weights) and a 12-hour run ona Cray-2 suDercomDuter (of a Droeram that called this error surface code) to accomplish this discovery. T he existence of the minimu m was provenby showing that all of the first partial derivatives of the mean squared error function went to zero at this point and that the Hessian of this function(the m atrix of second partial derivatives) was strongly positive-definite at this point.

    1-596

  • 8/12/2019 Back Prop Theory

    5/13

    In summary, three basic facts are kn m n ab out backpropagation errot surfaces. First, because of ombina tork p a m u t a t hleave the network inputroutput function unchanged, these functions typically have huge numbers of global minima (which may ie at nfinity for-roblems). This causea the error surfaces to be highly degenerate and to have numerous 'troughs'. h o n d l y , e rr o r s u r l a ca bw d t i t u L dare=4 Function Approximation with BackpropagationThe queation of what functional forma can be approximated by neural networks has had p ivotal importance in th e history of neumomputing. Forexarnpk, the fact that a narrowly formulated type of perceptron could be shown incapable of impkmenting the EXCLUSIVE OR bgkd operation[a]M used in the 1980's as argument t o divert funding from neurocomputing to artificial inteDigence. Recently, similar qnmtiom baw beenraised and c la i m tha t l i t t le progreaa has bee n m ade on t h i front [NIhve generated cmeern.

    ical moultlear insight into the ve rsatility of neural networks for we in f indom approximat ion came with the dieeovery[a]ha t a classicof Kolmogom 301wm actually a statement that for any continuow m apping f : 0, ll C R - ' there must exist a thnchyu mumi network(having an input or Yanout layer with n processing kments , a hidden layer with 2n 1)processing elements, and an o utpu t layerwith m processingelements) that implements f czaeffy. This result gave hope that neural networks would t urn out to be able to a pin the real world.Kdmogorov's theoran m a a first step. Th e following new result shows hat the backpropagation network is ibelf abpractical intereat to any desired degree of accuracy. To make the e x p i t io n precise, we s ta r t wi th some h d r g o u n d IIILet [0, lp be the doed unit cube in n-dimnaional Euclidean space. Given any squa reinteg rabk function 8 : 0,lIn.c Rn i.e.ULtr , t can be y the theory of Fourier aeries [12] that the auiw

    c o n v e r g to f in the sense tha t

    This is an example of he property of having the integral of the square of he error of approximation of one function by another go to zero. Thisperty is described by the statemen t tha t th e approximation can be achieved to within any desired degree of accuracy in theGiven a function f : O, l] c R - , e say tha t f belongs i s Lz (or is 152 ) if each off's coordinate functions is square incube. For functiona of this class it is assumed that the x vectors are chosen uniformly in [0, 11 (relaxing this condition is fairly easy). Clearly if avector function of a vector variable is in Lz then each of its components can be approximated by its Fourier series to red degree of accuracyin the mean squared error sense. With this background as preamb k, the new result is now presented.Theorem 1 Gwen any c > 0 and any Lz funcfton f : [0, ] C R - ' , hem cmsfs a fhnc-layer baoppwztmaie f io mihtn z mean $quared e m f accumcy.Proof: et be the accuracy to which we wish to approximate each of the coord inate functions i (where f(using a three-layer backpropagation neural network. Note that, by virtue of he results from Fourier series theoexists a positive integer N and coefficients CIk such tha t

    me. T his leads to a definition.

    n a n y 6 r > O , t h m

    We belpn by show ing tha t each of the sine and cosine terms required in th e Fourier series (i.e., in the real part of he complex Fourier series - heimaginary part vanishes) for r can be implemented to any desired degree of absolute accuracy by a subset of a threelayer backpropagation neuralnetwork. The idea is to use the input layer units, a contiguous subset of he hidden layer units, and the corresunit - unit I out of a total of rn output units - to implement each sine or cosine function. T he outpu t unitsused here do not employ sigmoid functions on their outp uts. To carry out this approx imation, first note tha t tlayer (assumed here to be comprised of contiguous hidden units), and the Ph output unit can compute any sum of he for

    :EHVh . ( ; W i , z J )where woJ L for j = 1 , 2 , . n, nd zo I is the bias input t o each hidden uni t. Next. we note that each of the argu ments of the sine and cosinefunctions in the Fo urier approximation of r (namely the terms U = u(k. x ) = 2ni k x) are of he form

    1-597

  • 8/12/2019 Back Prop Theory

    6/13

    Since, by simply adjus ting the argu ment bias w eight wio by - r/2) we can change a sine into a cosine, we shall concern ourselves exclusively withsines. Thus, to absolutely approximate a particular Fourier series sine or cosine function all we need to show is that, given any 62 > 0, we can findcoefficients oli and w ij such t ha t

    for each x E [0,I] . To show this, choase the wij such thatn wijzj = Pi (u(k, x) - ai)

    j O9)

    where the quantities Pi and a, re arbitrary real con stants to be selected below. We are going to dedicate a certain (typically large) subset Hk ofhidden layer units (assumed to be contiguous) to the calculation of each sin(u(k,x)) and cos(u(k,x)). So,with this transformation for each hiddenlayer unit within Hk we can rewrite the above inequality as

    It is easy toshow, given asufficien tnumber of hidden layer units in H k, th at this inequality can always be satisfied. T he first step in this demonstrationis to examine the geometrical form of the sum

    which, by the above argument, can be computed by the inpu t layer, hidden layer subset H k, a nd outp ut unit I . N ote that for x E 0, l] , U rangesover some closed interval d (k) 5 5 (k). It is a closed interval because the affine transformatio n x Y U is continuous and continuous m a p preservecompactness and simple connectedness. So, we need to ap proxim ate the function sin (u) on the closed interval d _ U 5 using this sum. To do this,partition the interval as followsd = ai < a,, ai3< . < ai,.,, = e (12)

    where ip+ l = i +1 and U ~ ~ ~ { i , )H k .By letting Pi, = 0 and by choosing the Pi,, p > 1 o b e sufficiently large it is easy to see tha t th e sum S a, ,v l,x ) has th e geometrical form shownin Figure 3, since each of the terms in the sum is a stee p sigmoid (sort of like a step function with m icroscopically rounded corners) tha t is essentiallyequal to 1 for U > ai, and equa l to 0 for U < ai, (and equal to 0.5 when U = ai,).Clearly, S has the approximate form of a staircase, where the step lengths are determined by the differences (ai,+, - i,) and the step heights aredetermined by th e coefficients uli,. By set ting the sigmoid gains Pi,, p > 1 to high enough values this basic staircase form can always be achieved, nomatter how small the ste ps become or how many steps are used.Given the above facts about the geometrical form of the sum S, Figure 4 demonstrates graphically that no matter how small 62 > 0 is chosen, a sumS(a ,P ,v l ,x) can be construc ted so th at it always remains within the 62 error band around s in(u). By starting a t the left end of the interval [ d , e ] ,and working to the right, successive sigmoid stairsteps can be incrementally added to achieve this.Thu s, we have shown that by choosing H k t o have an adequately large number of processing elements and by properly selecting the a, , and VI vectors,the partial sum outp ut (due to the members of H k) will be within 62 of sin(u(k,x)) or cos(u(k,x), if that is what is being fitted), for all x E [0,1]".Since output unit I receives the in puts from the hidden layer units of each of the sine and cosine subsets Hk for eoery k E V I {-N,-N +1 , . . N)we can then approximately generate the Fourier series for i by simply multiplying all of the sine (or cosine) coefficients oli for i E Hk by theappropriate combinations of the real and imaginary parts of elk (the complex Fourier expansion coefficient of fr for term k in the expansion). Callthese sine and cosine multipliers a(1, k ) a n d b 1, k) , r espe ct ively, a nd le t ~ ( x )e the outpu t signal of outpu t unit 1. Thus, we get

    F ( f i , x , N ) - d ( x ) = a ( l, k )[ s in ( % ri k . x ) - S (a , P , v r , x ) ]+b ( l , k ) [ c o s (2 + i . x ) - S ' ( a , P , v ~ , x ) ] , ( 13 )kcvwhere the S' s u m is t ha t used for the H k cosine term.Put t ing dl of the above together we get:

    1-598

  • 8/12/2019 Back Prop Theory

    7/13

    a-'PFigure 3 Geometrid Form d he be t i i on S(a.B, VI,%) with 8 =0 and larp 8, or p > 1.

    U

    Figure 4: Graphical demonstrat ion that Q, p, and VI vectors can always be chosen so that the function S ( a , p , v lerror band of width 62 around sm(u).The first inequality follows from Pythagorus' theorem and the second from the fa& that i can be approximateand that the absolute differencesbetween sin(2rIZ . x ) and S ( a , B , v ~ , x )re always less than 61, or all x E 0, l] , olloned by a second applicationof Pythagorus' theorem. So, given any desired mean square d error c > 0, we can select 61 and 62 sufficiently small so hat we will have

    r1-599

  • 8/12/2019 Back Prop Theory

    8/13

    which proves the theorem. To relate this result to th e error function of Section 3, we need only note tha t

    for L2 functions.The space L2 includes every function that could ever arise in a practical problem. For example, it includes the continuous functions and it includesll discontinuous functions that are piecewise continuous on a finite num ber of subse ts of [0,1]. Lz also contains much nastier functions than these,but they are only of mathem atical interest.

    It is important to realize that although this theorem proves that three layers are always enough, in solving real-world problems it is often essentialto have four, five, or even more layers. This is because. for many problems an ap proxi matio n with three layers would require an impractically largenumber of hidden units, whereas an adequate solution can be obtained with a tractable network size by using more than three layers. Thus, theabove result should not influence the process of select ing an app ropria te backpropagation arch itecture for a real-world problem - except to providethe confidence that comes from knowing that an ap propriate backpropagation architecture musf exist.Finally, although th e above theorem guarantees the ability of a multilayer network with fhe correct weights to accurately implement an arbitrary L2function, it does not comment on whether or not these weights can be learned using any existing learning law. T hat is an open question.

    5 Backpropagation Learning LawsMost, if not all, of the w idely disseminated published derivatio ns of the backpropagation neural netwo rk architec ture learnin g law - he generalizeddelta rule - are flawed or incomplete. (Chapter 8 of 1451 provides one of the better derivations, although a highly idiosyncratic notation is used.ThisSection provides a careful derivation of the gen eralized delta rule and a brief discussion of some curre nt research direeti ons in backpropagationlearnin g laws.The first step in th e derivation of the generalized delta rule is to recall tha t our goal is to move the w vector in a direction so that the value of Fwill be smalle r at the new value. Obviously, the best appro ach would be to move in the w directi on in which F is decreasing most rapidly. ssumingt ha t F is differentiable, ordinary vector analysis tells us that this direction of maximum decrease is given by -VwFIw) = - E,e,.e),where, as in Section 3, it is assumed that there are Q weights in the network (i.e., the vector w is assumed to have Q omponents). However, beforeproceeding with t his line of thought it is essential to show tha t F is in fact differentiable.First, note that because backpropagation networks are made u p entirely of affine transformations and smooth sigmoid functions, B(x, w) is a Cmfunction of both x and w (i.e., its partial derivatives of ll orders with respect to the components of x and w exist and are con tinuous) and itsderivative limits all converge uniformly on compact sets. T hus, Ft,as a function of w, inherits these same properties, since Ft = If(xt) - B( xt, w)l,and since ffx.) is a constant vector. What we will need later on are two facts: f irst, that F is a differentiable function, and second, th at VwF(w)=l i i m p ~ - , ~c=WFk(w) almost surely. These two facts can be simultaneously established. First, fix w and, given an arbitrary example sequence( ~ 1 ~ x 2 , ..XI,. } (as defined in Section 3 abov e), let

    where ui is the un it basis vector along the i* coordinate ax i s in weight space, where the real variables r and A both range over a compact neighborhoodU of zero (with zero removed) and where [uJ ( the floor function) is the largest integer less than or equal to U. Because the derivatives of Ft allconverge uniformly on compact sets in weight space he limit

    converges uniformly on U. urther, asshown in Section 2 above, for each A in U the limit

    converges almost surely. Thus, y t he elementary theory of iterated limits [31] the double limits

    andlim limGi(r,A)A-Or-0

    &h &t and are equa l, almost surely. Thu s, we have shown tha t, almost surely, F is a differentiable function and that VwF(w)= k v kE==v wFk(w).Given these preliminaries, the generalized delta rule is now derived. Tbe basic idea is to derive a ormula for calculating -VwF(w) =- G,F =,F . e)y using training examples. To do this, we focus on he calculation of E.We first note that , Using the above d t s ,

    1-600

  • 8/12/2019 Back Prop Theory

    9/13

    - 1 aF k W)2 c wpawP k = lbut, reverting to full weight indices (where the pthweight has full index l i j),

    a F k ( w ) = aFdw) 8 4 3 4-awp - aWll ail, awl l jbecause any functional dependence of Fk on wltJ must be through hi (see Section 2). If we now define 6; to be 2 nd evaluatethe formula for If,, we get

    where , is the ou tput signal of the jth un of the lth row on the forward pass of he kth training trial. Bu t,~ = aFk h i - af kla - 84, azllahi az l ,

    where 3 is the pth component of yk Thus, for the outpu t layer units6; = -2 (yi - : ) ~ ( I l i ) .

    If row I is a hidden layer, then

    us, or the hidden layer units6; = S'(I1i ) C +I), w(l+l)pi ,

    P

    So, ubstitutin g these results back in to th e formula for ives

    wnew= wold- aVwF w),

    ere I> 0 is a small constant called the learnang rote. This is the generalized delta rule learning law.The rigorous generalized delta rule is approximated by the learning law used in the backpropagation architectsubstitutes a finite averaging sum of batch-sue terms for the infinite-size average of the formula. Since we knowclearly be acceptable, assuming batch-sue is chosen large enough.A number of variants of his learning law (and of other parts of the backpropagation neural network architecture) have belearning law variant in comm on use is

    w w = w t - I ; z l -l )J.

    etwork's initial weight

    1-601

  • 8/12/2019 Back Prop Theory

    10/13

    Th e development of new learning laws for backpropagation continues. The general goal is to provide a faster descent to the bo ttom of the errorsurface. One line of investigation [48,42,3] s exploring the use of approximations to the pseudoinverse [29]of the Hessian matrix ( a Z F / ~ w i a w , )ocalculate individually variable learning rates a alues in Equation (34)) for each weight will. The underlying concept is to use Newtons method forfinding a place where Vw F( w) = 0. So far, this work shows promise, but a major advance in convergence speed has yet to be realized for arbitraryproblems. Anoth er approach that shows promise in early tests is the incremental network growing technique of Ash [2]. This method starts with asmall number of units in each hidden layer and adds additional units in response to changes in the error level asmeasured using test data).It is probably reasonable to anticipate that faster learning techniques will be developed in the future as more becomes known abo ut the s tructureof backpropagation error surfaces, and as these facts are exploited. The advent of such faster learning techniques will correspondingly increase thebreadth of applicability of the backpropagation architecture.Finally, it is important to note that this paper has examined only one (albeit perhaps the most important) variant of the backpropagation neuralnetwork architecture and its learning law. O ther im port ant variants exist. Examples of these include architectures in which connections skip layers [45],recurrent architectures [45], sigma-pi higher order architectures [45], the second-order learning laws of Parker [39], the bulk learning approachof Shepanski [47J, he high learning rate approach of Cater [9], and the reinforcement training concept of WiIliams [58]. New architectures formedby combining backpropagation with other architectures such as Carpenter and Grossbergs Adaptive Resonance [6,7]and Fukushimas Neocognitron[14,15]can also be anticipated.Methods for adding weight-dependent terms to a networks error function that create a force pushing the network towards a smaller number ofnon-zero weights, a smaller number of processing elements, or some other goal have also been dem onstrated [51,43]. As these on-line methods andrelated off-line method s are developed furth er, it may become possible for backpropag ation architectures to become self-adjusting. Statis tical testsfor determining whether a backpropagation network is overfitting or u nderfitting a particular training set may well emerge. Th ese may somedayallow the development of automated systems t ha t will optimally craft a backpropagation network for a particular da ta set.

    6 ConclusionsAs demo nstrated by th e above results, backpropag ation is a new tool for approximatin g functions on the basis of examples. White [53]has summarizedthe value of this neural network architectu re by noting tha t . . esearchers have been searching a long tim e for an effective set of approximation func-tions tha t ad apt their functional form to t hat of the function t o be approximated. By employing hierarchical sigmoided adaptive a f i e combinations,backpro pagation achieves this goal by creating a coalition of functions tha t h ave their frequencies, phases and amplitudes adjusted t o preciselyfit the function being approximated. Th e concept of approxim ating arbi trary Lz functions using functional forms tha t do not depend upon eitherorthogo nality or linear superposition may well turn out to be an importan t theme not only in neurocomputing, b ut in mathem atics, engineering,physies as well; and possibly even in neuroscience (see the Appendix below for a speculation in th is direction). Th e most imp ortan t results aboutbackpropagation are undoubtedly yet to come.After presenting this paper (including Theorem 1) a t the 1988 INNS Meeting in September 1988, it was brought to my attention that White and hiscolleagues had , almost simultaneously, independe ntly discovered a similar theorem [24]. Coincidentally, it was Whites earlier paper with Gallan t [16]th at led me to the idea for the above theorem. O ther papers examining the issue of Theorem 1 that have subsequently been brought to my attentioninclude a paper also presented at INNS-88 by Moore and Poggio [37]and the 1987 doctoral diss ertation of le Cun [35]. &cent work by Irie and Miyake[26], le Cun [34], and Becker and le Cun [3] is also relevant. The au tho r wishes to th ank t he referees for numerous helpful comments and suggestions.This work was partially supported by the SD I Innovative Science and Technology Program under U. S.Army Strate gic Defense Command contractDASGGO-88-CO112 to HNC, Inc.AppendixUntil now it has been extremely difficult to believe that the tra ditional backpropagation neural network architectu re was relevant to neurophysiologyat the cellular level (this is not tr ue for non-traditional varian ts of backpropagation such as those of Parker [41,40]). Thi s dil5culty follows from thefact tha t past cons tructions of the traditional backpropagation architecture have involved non-local processing - which is believed to b e impossible inneural tissue. Since the new architecture presented in Section 2 above eliminates the non-locality of backpropagation and makes it a legitimate neuralnetwork, while retaining the tradition al math ematical form, it m ay be sensible to reexamine the possible biological relevance of backpropag ation. As as ta r t in this direction this Appe ndix presents a plausible, bu t highly speculative, hypothetical neurophysiological implementatio n of backpropagation.I t is important to point out that the neurons involved in this proposed neural circuit are almost certainly also involved in other circuits as well, someof which might well be active at th e same time. T hus , this hypothesis does not att emp t to account for for the totality of cortical function, merely oneaspect of it - namely, the learning of associations or mappings between nearby cortical regions. Th e hypothe sis is attractive in that, as demonstratedin Section 4 above, multilayer networks can learn virtuall y any desired associative mapp ing, unlike most of th e simple linear associative schemes thathave been proposed. (Although, in fairness, some basically linear schemes are more capable tha n they might initially appear, in particular: the sparseassociative memory of Willshaw, et al. [61] as well as the Bidirectional Associative Memory (BAM) of Kosko [32,33]and the Hopfield network [23],at least when the capacity improvement modifications proposed by Haines and Hecht-Nielsen [19] are incorporated.)The hypothesis presented here is that backpropagation is used in the cerebral cortex for learning complicated mappings between nearby weas ofcortex tha t are interconnected by th e axons of shallow pyramid neurons. As s well known [13,28], he w hite matter axons of small to medium shallowpyramids are short and only go to nearby areas of cortex as elements of the short association tracts [ll]. The axon collaterals of larger, deeperpyramids also go to these adjacen t cortical regions. B ut, unlike the shallow pyramids, other collaterals of the deep pyramids make u p the ma jor brainfascicles tha t interconnect distant cortical regions and send signals to extracortical areas.The backpropagation hypothesis assumes that the forward pass of the network is active almost all of the time. The backward pass is triggered onlyoccasionally - namely, when it is necessary to learn something. It is assumed that this learning operation is triggered by some sort of mismatchdetection, att entio n direction searc hlight, or imp ortant event detection function carried out by thalamic tissue (including the LGN,MGN,pulvinar, and the thalamus proper) and/or the thalamic reticular complex a la the theories of Grossberg [17,18]and Crick [lo]. It is assumed thatinput to cortex from thalamus triggers the backward pass (which then proceeds using Equation [34]).* he thalamic input is assumed to modify thebehavior of all of the cells of the effected area of cortex, except th e deep pyramids - which are exempted, perh aps by means of the action of horizontalcells (which are known t o receive thalamic signals and which preferentially synapse with t he apical dendrites of the deep pyramids) [ll].Th e details of the hypothesis are presented in Figure 5. Th e deep pyramid cells (which are known to carry ou t an integrative function ) are assumedto carry out the feed-forward function of the suns - namely, summing the inputs from the shallow pyramids (which, together with normal stellatecells, are assumed to carry out both the forward and backward pass functions of the planets) and applying a sigmoid function to the sum. Thus,

    1-602

  • 8/12/2019 Back Prop Theory

    11/13

    each deep pyramid and the shallow pyramids that feed it make up a sun and planet unit of Figures 1and 2, The feedback fusumming the back-propagated error signals) is assumed to be carried out by cortical basket or basket stellate cells which synapse with shallowyramids over a considerable local area (unlike the norma l stellates, which only synapse with a sma ll number of shallow pyramids that are very close)[27]. These basket stellatea are assumed to receive input preferentially from the cortico-cortical r mon s of shallow pyramids of nearbycortical areas (i.e., those participating in the short association tracts of the white matte r), but not d inputs. The shallow pyramid/normalstellate planet units are assumed to receive inputs from deep pyramids exclusively. Th e suppos of this neural circuit is now described.Throughout the discussion it is assumed tha t the transm itted signals are pulse frequency modulated transmissions ranging continuously between zerofiring frequency and some maximum firing frequency (corresponding to the asymptotic 0 and 1outp uts of the backpropagation sigmoid). T he need toaccommodate negative signals for error feedback is ignored here - presumably some sort of basket stellate or normal stellatbias could fix this problem.

    shallow pyramids is then also sent to nearby cortical areas via the same association bundlesdeep pyramid mons then &apse with normal stellate/shallow pyramid planet units; thu s player.

    ellate and adding .hem in u pdating their w eights by multiplying th e incoming deep pyramid signal times th e error signalthis to the existing weight. While it is not easy to envision how a couple of cells can ca rry out these calThe shallow pyramids then transmit the product of the basket stellate input (the error for this groupafter updating). T hese error signals are then transm itted to the appropriate basket stellate@ )of the nextFigures 1 and 2 that these connections must be very specific (unlike the deep pyramid outputs, which must .accordance withly distributed). In particular,

    those few basket stellates needed by backpropagation may be statistically meaningless becau se they are randomly uncorrelated with activity in thetarget region. These connections might be used to implement other networks at other we can conclude tha t for this hypothesisto be correct there must be many more shallow pyramids than deep pyramids and tha t st have more numerous and more broadlydistributed axon collaterals. Clearly, this jibes with the neurophysiological facts. Perha ps rude) cortical backpropagation hypothesiscan serve to stimulate some useful thought and discussion.

    .\

    ...tFigure 5: A possible cerebral neurophysiological implementation of backpropagation (essentially Figure 2 ns). In this hypothesis, th edeep pyramids (labeled dp) carry out the feedforward functions of the suns. Basket stellates (labeled bs) the feedback error summation ofthe suns. Normal stellates (labeled ns) and shallow pyramids (labeled sp), working together, carry out ons of the planets (deep pyramidinput weighting and weight modification). Thalamic EEG signals trigger th e backward pass (via collaterals not shown) and also activate horizontalcells (labeled I ) - which allow the deep pyramids t o continue firing dur ing learning by exempting them from the influence of th e thalamic signals.

    1-603

  • 8/12/2019 Back Prop Theory

    12/13

  • 8/12/2019 Back Prop Theory

    13/13

    [31] Korn, Granino A., and Korn, Theresa M., Mathematical Handbook for Scientists and Engineers,Second Edition, McGrYork, 1968.

    [34] le Cun , Yann, A T heoretical Framework for Back-Propagation ,Technical Report CRG-TR-88-6, Connectionist Research Grou[35] le Cun, Yann, Modeles Connexionnistes de l'Apprentissage , Doctoral Dissertatio n, University of Pierre and Marie Curie, P[36] McInerney, John M ., Haines, Karen G., Biafore, Steve, and Hecht-Nielsen, Rob ert, Can Backpropagation E rror Surfaces Have

    Minima? , Departm ent of Electrical and Co mputer Engineering, University of California at San Diego, Manuscript, August 1988.

    of Toronto, Canada, September 1988.

    Volume 1 of Neural Networks, 1988.[38] Minsky, Marvin, and Pap ert, Seymour, Perceptrons,Expanded Edition, MIT Press, 1988.[39] Parker, David B., Optimal Algorithms for Adaptive Networks: Second Order Back Propagation, Second Order Direct PropagatiOrder Hebbian Learning , Proc. 1987 IEEE Internatzonal Conference on Neural Networks, II(593-600), IEEE Press, New York

    [41] Parker, David B., L earning-Logic ,Technical Report TR-47, Center for Computational Research in Economics and Managern[42] Ricotti, L. P., Ragazzini, S., Martinelli, G., Learning of Word Stress in a Sub op ti m al Second order Back-Propagation Neural Net[43] Rumelh art, David E., Parallel Distribute d Processing , Plenary Lecture presented a t Proc. 1988 IEEE Internatronal Conferenc[44] Rumelhart, David E., Hinton, Geoffrey E., and William, Ronald J., Learning representations by back-propagating erro[45] Rumelh art, David E., and M cCleUand, Jam es L., Parallel Distributed Processing: Explorations in the Mierostructu[46] Rumelhart, David E., Hinton, Geoffrey E., and W illiams, Ronald J., Learning Internal Represen tations by Error P

    April 1985.1988 IEEE Internatronal Conference on Neural Networks, 1-355 - 1-361, IEE E P ress, New Yo rk, 1988.Neturorks, San Diego, California, July 1988.533-536, 9 October 1986.Vols. I, I1 & 111, MIT Press, 1986 C 1987.8506, Institu te for Cognitive Science, University of California at San Diego, Septembe r 1985.

    IEEE Iniernafional Conference on Neural Networks, II(619-627), IEEE Press, New York, 1987.[49] Werbos, Paul J., Backpropagation: Past and Future , Proc. 1988 IEEE Internatzonal Conference on Neural Neiworks, 1(3[50] Werbos, Paul J. Building and Understan ding Adaptive Systems: A Statistical/Num erical Approach to Factory Autom[51] Werbos, Paul J. Learning How The World Works: Specifications for Predictive Networks in Ro bot s and Brains , P

    New York, 1988.search , IEEE I fans . on Systems, Man, and Cyber., SMC-17, o. 1,7-20, January/Feb ruary 1987.Systems, Man, and Cyber., IEEE Press, New York, 1987.

    [52] Werbos, Paul J ., Beyond Regression: New Tools for Prediction and Analysis in the Behavioral SciencMathematics, H arvard University, November, 1974.

    nal Conference on Neural Networks, II(601-608), IEE E Press, New York, 1987.California at San Diego, May 1985.