nonmonotone methods for backpropagation training with adapatative learning rule

Nonmonotone Methods for Backpropagation Trainingwith Adaptive Learning Rate

V.P. PlagianakosUniversity of Patras,

Department of Mathematics,U.P. Artificial IntelligenceResearch Center–UPAIRC,GR-26500 Patras, Greece.

M.N. VrahatisUniversity of Patras,

Department of Mathematics,U.P. Artificial IntelligenceResearch Center–UPAIRC,GR-26500 Patras, Greece.

G.D. MagoulasUniversity of Athens,

Department of Informatics,U.P. Artificial IntelligenceResearch Center–UPAIRC,GR-15771 Athens, Greece.

Abstract

In this paper, we present nonmonotone methods forfeedforward neural network training, i.e. training meth-ods in which error function values are allowed to in-crease at some iterations. More specifically, at eachepoch we impose that the current error function valuemust satisfy an Armijo-type criterion, with respectto the maximum error function value of M previousepochs. A strategy to dynamically adapt M is sug-gested and two training algorithms with adaptive learn-ing rates that successfully employ the above mentionedacceptability criterion are proposed. Experimental re-sults show that the nonmonotone learning strategy im-proves the convergence speed and the success rate ofthe methods considered.

Introduction

The batch training of a Feedforward Neural Network(FNN) is consistent with the theory of unconstrainedoptimization and can be viewed as the minimizationof the function E; that is to find a minimizer w∗ =(w∗

1 , w∗2 , . . . , w

∗n) ∈ IRn, such that:

w∗ = minw∈IRn

E(w), (1)

where E is the batch error measure defined as thesum–of–squared–differences error function over the en-tire training set.

The widely used batch Back–Propagation (BP) [28] isa first-order neural network training algorithm, whichminimizes the error function using the steepest descent

method [8]:

wk+1 = wk − η∇E(wk), (2)

where k indicates iterations (k = 0, 1, . . .), the gradientvector is usually computed by the back–propagation ofthe error through the layers of the FNN (see [28]) andη is a constant heuristically chosen learning rate. Ap-propriate learning rates help to avoid convergence to asaddle point or a maximum. In practice, a small con-stant learning rate is chosen (0 < η < 1) in order tosecure the convergence of the BP training algorithmand to avoid oscillations in a direction where the errorfunction is steep. It is well known that this approachtends to be inefficient [12, 28]. For difficulties in ob-taining convergence of BP training algorithms utiliz-ing a constant learning rate see [13, 16]. On the otherhand, there are theoretical results that guarantee theconvergence when the learning rate is constant. In thiscase the learning rate is proportional to the inverse ofthe Lipschitz constant which, in practice, is not easilyavailable [1, 17].

The paper is organized as follows: in the next sectionthe class of backpropagation algorithms with adaptivelearning rate is presented and the advantages as wellas the disadvantages of these algorithms are discussed.Then, the idea of nonmonotone training is introduced.Afterwards, two methods with adaptive learning rate,which incorporate the proposed nonmonotone learningstrategy are described. Finally, simulation results arepresented and the paper ends with some concluding re-marks.

Adaptive Learning Rate Algorithms

Several adaptive learning rate algorithms have beenproposed to accelerate the training procedure. The fol-lowing strategies are usually suggested: (i) start with asmall learning rate and increase it exponentially, if suc-cessive epochs reduce the error, or rapidly decrease it, ifa significant error increase occurs [3, 31], (ii) start witha small learning rate and increase it, if successive epochskeep gradient direction fairly constant, or rapidly de-crease it, if the direction of the gradient varies greatlyat each epoch [6] and (iii) for each weight an individuallearning rate is given, which increases if the successivechanges in the weights are in the same direction anddecreases otherwise [12, 22, 26, 29]. Note that all theabove mentioned strategies employ heuristic parame-ters in an attempt to enforce the monotone decreaseof the learning error and to secure the converge of thetraining algorithm.

A different approach is based on Goldstein’s andArmijo’s work on steepest–descent and gradient meth-ods. The method of Goldstein [9] requires the as-sumption that E is twice continuously differentiableon S(w0), where S(w0) = {w : E(w) ≤ E(w0)} isbounded, for some initial vector w0. It also requiresthat η is chosen to satisfy the relation sup ‖H(w)‖ ≤η−1 < ∞ in some bounded region where the rela-tion E(w) ≤ E(w0) holds, where H(w) denotes theHessian of E at w. However, the manipulation of thefull Hessian is too expensive in computation and storagefor FNNs with several hundred weights [4]. Le Cun [14]proposed a technique, based on appropriate perturba-tions of the weights, for estimating on–line the extremeeigenvalues and eigenvectors of the Hessian without cal-culating the full matrix H. According to experimentsreported in [14] the largest eigenvalue of the Hessian ismainly determined by the FNN architecture, the initialweights and by short–term low–order statistics of thetraining data. This technique could be used to deter-mine η requiring additional presentations of the train-ing set in the early training.

An alternative approach is based on the work ofArmijo [1]. Following this approach, the value of thelearning rate η is related to the value of the Lipschitzconstant L, which depends on the morphology of theerror surface. In this case, the BP algorithm takes theform:

wk+1 = wk − 12L

∇E(wk), (3)

and converges to the point w∗ which minimizes E(see [1] for conditions under which convergence occursand a convergence proof).

Nonmonotone Learning and GlobalConvergence

A training algorithm can be made globally convergentby determining the learning rate in such a way thatthe error is exactly minimized along the current searchdirection at each epoch, i.e. E(wk+1) < E(wk). Tothis end, an iterative search, which is often expensivein terms of error function evaluations, is required. Itmust be noted that the above simple condition does notguarantee global convergence for general functions, i.e.converges to a local minimizer from any initial condi-tion (see [7] for a general discussion on globally conver-gent methods).

The use of adaptive learning rate algorithms which en-force monotone error reduction using inappropriate val-ues for the critical heuristic learning parameters canconsiderably slow the rate of training, or even lead todivergence and to premature saturation [15, 27]. More-over, using heuristics it is not possible to develop glob-ally convergent training algorithms.

To alleviate this situation we propose the following ap-proach that exploits the accumulated information re-garding theM most recent values of the error function,to accelerate convergence. The following condition isused to formulate the proposed approach and to definea criterion of acceptance of any weight iterate [10]:

E(wk −ηk∇E(wk)

)− max

0≤j≤ME(wk−j) ≤

−σ ηk∥∥∇E(wk)

∥∥2, (4)

where M is a nonnegative integer, 0 < σ < 1, andηk indicate the learning rate in the kth epoch. Theabove condition allows an increase in the function val-ues without affecting the global convergence propertiesas has been proved theoretically in [10, 25] and experi-mentally in [24]. Experiments indicate that the choiceof the parameter M is critical for the implementationand depends on the problem. The following procedureprovides an elegant way to dynamically adapt the valueof M at each epoch:

Mk =

Mk−1 + 1, Λk < Λk−1 < Λk−2,

Mk−1 − 1, Λk > Λk−1 > Λk−2,Mk−1 , otherwise,

(5)

where Λk is the local estimation of the Lipschitz con-stant [17] at the kth iteration, defined as:

Λk =

∥∥∇E(wk)−∇E(wk−1)∥∥

‖wk − wk−1‖ . (6)

Note that the local approximation of the Lipschitz con-stant is easily obtained, without any additional error

function and gradient evaluations. Obviously, M hasto be positive. Thus, if Relation (5) gives a non posi-tive number, the value ofM is set toM = 1. Moreover,in practice, we have to enforce an upper limit to the val-ues of M , as well. Experimental results indicate thatM = 10, is a good choice for the maximum value ofM .

If Λk is increased for two consecutive epochs, the se-quence of the weight vectors approaches a steep region,so we decrease the value of M , to avoid overshootinga possible minimum point. Reversely, when Λk is de-creased for two consecutive epochs, the method possi-bly enters a valley in the weight space, so we increasethe value ofM . This allows the method to accept largerlearning rates and move faster out of the flat region. Fi-nally, when the value of Λk has a rather random behav-ior (increasing and decreasing for consecutive epochs),we leave the value of M unchanged.

Next, we propose a high-level description of an algo-rithm that employs the above acceptability criterion.The kth iteration of the algorithm consists of the fol-lowing steps:

1: Compute the new learning rate ηk using anylearning rate adaptation strategy.

2: Update the weights wk+1 = wk − ηk∇E(wk).

3: If the acceptability condition (4) is fulfilled storewk+1 and terminate; otherwise go to the nextstep.

4: Use a tuning technique for ηk and return toStep 3.

Remark 1.: A simple technique to tune ηk at Step 4 isto decrease the learning rate by a reduction factor 1/q,where q > 1 [21]. We remark here that the selection of qis not critical for successful learning, however it has aninfluence on the number of error function evaluationsrequired to satisfy the condition (4). The value q = 2is usually suggested in the literature [1] and indeed itwas found to work without problems in the experimentsreported in the paper.

Remark 2.: The above model constitutes an efficientmethod to determine an appropriate learning rate with-out additional gradient evaluations. As a consequence,the number of gradient evaluations (GE) is, in general,less than the number of error function evaluations (FE).

Nonmonotone Training Algorithms withAdaptive Learning Rate

The nonmonotone learning strategy can be incorpo-rated in any training algorithm. In this section, we

briefly describe two recently proposed training algo-rithms with adaptive learning rate, which can be suc-cessfully used for nonmonotone backpropagation train-ing.

1) Adapting the rate of learning by locally estimating theLipschitz constant. The BPVS algorithm [17] exploitsthe local shape of the error surface by estimating theLipschitz constant at each epoch and setting the learn-ing rate ηk = 0.5/Λk, where Λk is the local estimationof the Lipschitz constant L at the kth epoch (see Rela-tion 6). Thus, when the error surface has steep regions,Λk is large and a small value for the learning rate is ap-propriate in order to guarantee convergence. On theother hand when the error surface has flat regions, Λk

is small and a large learning rate is appropriate to ac-celerate the convergence speed. If ηk = 0.5/Λk does notsatisfy condition (4), the learning rate tuning techniqueof Remark 1 is applied. This version of the BPVS thatprovides nonmonotone training is named NMBPVS.

2) Adapting the rate of learning by using the Barzilaiand Borwein formula. In a previous work [23] we haveproposed a neural network training algorithm calledBBP, which is based on the Barzilai and Borwein [2]learning rate update formula:

ηk =〈δk−1, δk−1〉〈δk−1, ψk−1〉 , (7)

where δk−1 = wk−wk−1, ψk−1 = ∇E(wk)−∇E(wk−1)and 〈· , ·〉 denotes the standard inner product. The keyfeatures of this method are the low storage require-ments and the inexpensive computations. Moreover,it does not guarantee descent in the error function E.Our experiments in [23, 24], show that this property isvaluable in neural network training, because very oftenthe method escapes from local minima and flat valleyswhere other methods are trapped. To secure that con-dition (4) is fulfilled, the learning rate tuning techniqueof Remark 1 is used to reduce relatively large learn-ing rates. This modified training algorithm is namedNMBBP.

Experimental Results

In this section, we evaluate the performance of the pro-posed algorithms. The algorithms have been tested us-ing the same random initial weights, and received thesame sequence of input patterns. The weights of thenetwork are updated only after the entire set of pat-terns to be learned has been presented.

We have fixed the value of σ = 10−5. Furthermore, themaximum value of M allowed was M = 10.

For each of the test problem, a figure summarizingthe performance of the algorithms for simulations thatreached solution is presented. A total number of 100simulations for each algorithm has been made. In fig-ures 1–6, the x-axis denotes the value ofM = 1, . . . , 10.ForM = 1, we have the original BPVS and BBP meth-ods using the Armijo’s criterion, while for M = V , wehave the proposed methods with adaptive M .

Note that the number of error function evaluations(FE) is, in general, larger than the number of gradientevaluations (GE), due to the learning rate acceptabil-ity criterion we use. As a consequence, even when ouralgorithms fail to converge within the predeterminedlimit of function evaluations, their number of gradientevaluations is smaller than the corresponding numberof other methods. Keeping in mind that a gradientevaluation is more costly than an error function eval-uation (see for example [18], where Møller suggests tocount a gradient evaluation three times more than anerror function evaluation), one can understand that ourmethods require fewer floating point operations and areactually much faster.

The font learning problem

For this problem, 26 matrices with the capital letters ofthe English alphabet are presented to a 35-30-26 FNN(1830 weights, 56 biases). Each letter has been definedin terms of binary values on a grid of size 5 × 7. Thenetwork is based on hidden neurons of logistic activa-tions and on linear output neurons. The performanceof the methods is exhibited in Figure 1 and 2.

For this problem the weights have been initialized fol-lowing the Nguyen–Widrow method [19]. The termina-tion condition is a sum-of-squared-differences error lessthan 0.1 and the error function evaluations limit was2000. In this problem the nonmonotone learning strat-egy accelerates the convergence of both BPVS and BBPmethods. Note that the performance of the methodswith adaptive M is at least as good as the average per-formance of the methods with a predefined M . All thealgorithms exhibited very good performance, as theyhad complete success percentage (see Figure 3).

Texture classification problem

A total of 12 Brodatz texture images [5]: 3, 5, 9, 12,15, 20, 51, 68, 77, 78, 79, 93 (see Figure 1 in [17])of size 512 × 512 is acquired by a scanner at 150dpi.From each texture image 10 subimages of size 128×128are randomly selected, and the co-occurrence method,introduced by Haralick [11] is applied. In the co-occurrence method, the relative frequencies of gray–level pairs of pixels at certain relative displacementsare computed and stored in a matrix. As suggested

by other researchers [20, 30], the combination of thenearest neighbor pairs at orientations 0o, 45o, 90o and135o are used in the experiment. A set of 10 sixteenth-dimensional training patterns are created from each im-age. The patterns are presented in a finite sequenceC = (c1, c2, . . . , cp) of input–output pairs cp = (up, tp)where up are the real valued input vectors in IR16 andtp are binary output vectors in IR12, for p = 1, . . . , 120,determining the corresponding training pattern. A 16-8-12 FNN (224 weights, 20 biases) is trained to clas-sify the patterns to the 12 texture types. The networkis based on neurons of logistic activations with biases,and the weights and biases were initialized with randomnumbers from the interval (−1, 1).

Detailed results regarding the training performance ofthe algorithms are presented in Figure 4 and 5. Thetermination condition is a classification error CE <3% [17]; that is the network classifies correctly 117 outof the 120 patterns.

The successfully trained FNNs are tested for theirgeneralization capability, using test patterns from 20subimages of the same size randomly selected from eachimage. To evaluate the generalization performance ofthe FNN the max rule is used, i.e. a test pattern isconsidered to be correctly classified if the correspond-ing output neuron has the greatest value among theoutput neurons. The methods with variable M exhib-ited slightly better generalization capabilities than theother methods.

Concluding Remarks

In this paper we have presented a new approach for gen-erating nonmonotone learning rules based on an accept-ability criterion. We incorporate information given bythe local estimation of the Lipschitz constant to dynam-ically adapt this criterion, according to the local shapeof the error function. The simulation results suggestthat the use of this acceptability criterion, can signifi-cantly accelerate the convergence speed of the trainingalgorithms. The behavior of the nonmonotone algo-rithms, in the considered experiments, proved to berobust against phenomena such as oscillations due tolarge learning rates.

References

[1] L. Armijo, Minimization of functions having Lipschitz-continuous first partial derivatives, Pacific J. Math.,16, 1–3, (1966).

[2] J. Barzilai and J.M. Borwein, Two point step sizegradient methods, IMA J. Numer. Anal., 8, 141–148,(1988).

[3] R. Battiti, Accelerated backpropagation learning: twooptimization methods, Complex Systems, 3, 331–342,(1989).

[4] S. Becker and Y. Le Cun, Improving the convergence ofthe back–propagation learning with second order meth-ods, in Proc. of the 1988 Connectionist Models Sum-mer School, D.S. Touretzky, G.E. Hinton, and T.J. Se-jnowski (eds.), 29–37, Morgan Koufmann, San Mateo,CA, (1988).

[5] P. Brodatz, Textures - a Photographic Album forArtists and Designers, Dover, New York, (1966).

[6] L.W. Chan and F. Fallside, An adaptive training al-gorithm for back–propagation networks, ComputersSpeech and Language, 2, 205–218, (1987).

[7] J.E. Dennis and R.B. Schnabel, Numerical Methods forUnconstrained Optimization and Nonlinear Equations,Prentice-Hall, Englewood Cliffs, NJ, (1983).

[8] P.E. Gill, W. Murray, and M.H. Wright, Practical Op-timization, Academic Press, NY, (1981).

[9] A.A. Goldstein, Cauchy’s method of minimization, Nu-mer. Math., 4, 146–150, (1962).

[10] L. Grippo, F. Lampariello, and S. Lucidi, A nonmono-tone line search technique for Newton’s method, SIAMJ. Numer. Anal., 23, 707–716, (1986).

[11] R. Haralick, K. Shanmugan, and I. Dinstein, Texturalfeatures for image classification, IEEE Trans. System,Man and Cybernetics, 3, 610–621, (1973).

[12] R.A. Jacobs, Increased rates of convergence throughlearning rate adaptation, Neural Networks, 1, 295–307,(1988).

[13] C.M. Kuan and K. Hornik, Convergence of learningalgorithms with constant learning rates, IEEE Trans.Neural Networks, 2, 484–488, (1991).

[14] Y. Le Cun, P.Y. Simard, and B.A. Pearlmutter, Auto-matic learning rate maximization by on–line estimationof the Hessian’s eigenvectors, in Advances in NeuralInformation Processing Systems 5, S.J. Hanson, J.D.Cowan, and C.L. Giles (eds.), 156–163, Morgan Kauf-mann, San Mateo, CA, (1993).

[15] Y. Lee, S.H. Oh and M.W. Kim, An analysis of pre-mature saturation in backpropagation learning, NeuralNetworks, 6, 719–728, (1993).

[16] R. Liu, G. Dong, and X. Ling, A convergence anal-ysis for neural networks with constant learning ratesand non–stationary inputs, Proceedings of the 34thConf. on Decision and Control, New Orleans, 1278-1283, (1995).

[17] G.D. Magoulas, M.N. Vrahatis, and G.S. Androulakis,Effective back–propagation with variable stepsize. Neu-ral Networks, 10, 69–82, (1997).

[18] M.F. Møller, A scaled conjugate gradient algorithm, forfast supervised learning, Neural Networks, 6, 525–533,(1993).

[19] D. Nguyen and B. Widrow, Improving the learningspeed of 2-layer neural network by choosing initial val-ues of the adaptive weights, IEEE First InternationalJoint Conference on Neural Networks, 3, 21–26, (1990).

[20] P.P. Ohanian and R.C. Dubes, Performance evaluationfor four classes of textural features, Pattern Recogni-tion, 25, 819–833, (1992).

[21] J.M. Ortega and W.C. Rheinboldt, Iterative Solutionof Nonlinear Equations in Several Variables, AcademicPress, NY, (1970).

[22] Pfister M. and Rojas R., Speeding-up backpropaga-tion - A comparison of orthogonal techniques, in Proc.of the Joint Conference on Neural Networks, Nagoya,Japan, 517–523, (1993).

[23] V.P. Plagianakos, D.G. Sotiropoulos, and M.N. Vra-hatis, Automatic adaptation of learning rate for Back-propagation Neural Networks, Recent Advances in cir-cuits and systems, Nikos E. Mastorakis, ed., World Sci-entific, (1998).

[24] V.P. Plagianakos, D.G. Sotiropoulos, and M.N. Vra-hatis, A Nonmonotone Backpropagation TrainingMethod for Neural Networks, Proceedings of the 4thHellenic–European Conference on Computer Mathe-matics and its Applications, Athens, (1998).

[25] M. Raydan, The Barzilai and Borwein gradient methodfor the large scale unconstrained minimization prob-lem, SIAM J. Optim., 7, 26–33, (1997).

[26] M. Riedmiller and H. Braun, A direct adaptive methodfor faster backpropagation learning: the Rprop algo-rithm, Proceedings of the IEEE International Confer-ence on Neural Networks, San Francisco, CA, 586–591,(1993).

[27] A.K. Rigler, J.M. Irvine, and T.P. Vogl, Rescalingof variables in backpropagation learning, Neural Net-works, 4, 225–229, (1991).

[28] D.E. Rumelhart, G.E. Hinton, and R.J. Williams,Learning Internal Representations by Error Propa-gation. In D. E. Rumelhart, and J. L. McClelland,eds., Parallel Distributed Processing: Explorations inthe Microstructure of Cognition, 1, pp. 318–362. MITPress, Cambridge, Massachusetts.

[29] F. Silva and L. Almeida, Acceleration techniquesfor the back–propagation algorithm, Lecture Notesin Computer Science, 412, 110–119. Springer-Verlag,Berlin, (1990).

[30] J. Strang and T. Taxt, Local frequency features for tex-ture classification, Pattern Recognition, 27, 1397–1406,(1994).

[31] T.P. Vogl, J.K. Mangis, J.K. Rigler, W.T. Zink, andD.L. Alkon, Accelerating the convergence of the back-propagation method, Biological Cybernetics, 59, 257–263,(1988).

Figure 1: Mean FE and GE numbers of BBP methodfor the font problem

Figure 2: Mean FE and GE numbers of BPVS methodfor the font problem

Figure 3: Success percentage of the BBP and BPVSmethods for the font problem

Figure 4: Mean FE and GE numbers of BBP methodfor the texture problem

Figure 5: Mean FE and GE numbers of BPVS methodfor the texture problem

Figure 6: Success percentage of the BBP and BPVSmethods for the texture problem

nonmonotone methods for backpropagation training with adapatative learning rule

Documents