hybrid evolutionary one-step gradient descent for training recurrent neural networks

Hybrid Evolutionary One-Step Gradient Descent for Training Recurrent Neural Networks

Rohitash Chandra1 and Christian W. Omlin2

1, 2 Department of Computer Engineering, Middle East Technical University, Guzelyurt, Turkish Republic of Northern Cyprus.

Abstract - In this paper, we present a fast, hybrid gradient descent and genetic algorithm for training recurrent neural networks. The hybrid algorithm uses the strengths of genetic algorithm and gradient descent learning in training recurrent neural networks (RNNs) for learning fuzzy finite automata. In the hybrid algorithm, the chromosomes are evolved using one-step gradient descent with genetic evolution. The hybrid algorithm is applied in learning deterministic finite-state automata using recurrent neural networks. The surprising results demonstrate that the hybrid algorithm trains recurrent neural networks faster when compared to training with regular genetic algorithm alone.

Keywords: Genetic algorithms, Recurrent neural networks, Hybrid algorithm, and Gradient descent.

1 Introduction Although neural networks have performed very well in many real-world application problems, training them can be cumbersome in cases where the data contains noise and is linearly inseparable. Traditionally, neural networks have been trained by the error back-propagation algorithm which employs gradient descent for training. There is a tendency for gradient based learning algorithms to get trapped in local minima resulting in poor training and generalization performance. To overcome this shortcoming, evolutionary techniques such as genetic algorithms have been used in neural network training which alleviate the problem of local minimum. Genetic algorithms are evolutionary search techniques; thus, training of neural networks using genetic algorithms can be time-consuming. Performance comparisons of the two methods have shown that genetic algorithms generally outperform gradient descent in training feedforward neural networks for real-world application problems [1]. In recent years, training neural networks using hybrid algorithms have gained much interest. A common hybrid technique uses an evolutionary algorithm in the initial training phase; once a certain number of training generations has been reached, the evolutionary search terminates and gradient descent training is used for final training. Thus, evolutionary training is used to search the weight space globally for a promising solution and gradient descent refines that solution through local optimization. Examples of such hybrid

approaches include the combination of genetic algorithms and gradient descent [2, 3], and particle swarm optimization with gradient descent [4].

Combining gradient descent with an evolutionary algorithm for training neural networks allows parallel and continuous global and local search for a solution in weight space. Gradient descent has been successfully embedded in genetic algorithms for image reconstruction [5]. Particle swarm optimization (PSO) has also been combined with evolutionary algorithms (EA) for training recurrent neural networks. It has been shown that the hybrid PSO-EA outperforms standard EA and PSO algorithms [6].

The strengths of gradient descent and genetic algorithms have motivated us to develop a hybrid GA-GD learning algorithm which alleviates and exploits their respective weaknesses and strengths, respectively. In our hybrid algorithm, gradient descent is embedded in a genetic algorithm. Gradient descent is used to evolve the chromosomes by updating weights with the error backpropagation algorithm. The fitness of each chromosome in the population is calculated after one weight update. The fitness directly affects the selection of parent chromosomes for the crossover operator which combines components of two parents into a single offspring. We use the roulette wheel method for probabilistically selecting parent chromosomes. Each gene in the offspring is then mutated according to the mutation probability. This hybrid algorithm differs from traditional genetic algorithms in that the evolution of weights is based on the gradient information and probabilistic mutation rather than mutation alone. Mutation offers a network the opportunity for an escape from a possible local minimum in weight space. In principle, gradient descent and genetic algorithms can be combined in two ways: the genetic evolution using crossover and mutation can be done prior to gradient descent weight update, or gradient descent weight update can be performed prior to genetic evolution. In fact, these two methods are equivalent: the first gradient descent weight optimization simply creates a new population; thus, we can think of that population as the initial random population of the hybrid algorithm that first performs crossover and mutation before the gradient descent optimization. Thus, there is no need to separately consider the two hybrid algorithms. In the remainder of this paper, we will only consider the hybrid algorithm which uses gradient descent weight update followed by crossover and mutation operations.

The remainder of the paper is organized as follows: In Section II, we discuss recurrent neural networks, fuzzy finite automaton, gradient descent and genetic algorithms for training RNNs. In Section III, we discuss the framework of the hybrid GA-GD algorithm in detail. In Section IV, we show empirical results on how the hybrid GA-GD algorithm outperforms traditional GAs in training RNNs on fuzzy finite automaton. We then conclude the work and discuss the feasibility of future research according to the results.

2 Material and Methods 2.1 Recurrent neural networks Recurrent neural networks have been an important focus of research as they can be applied to difficult problems involving time-varying patterns. Their applications range from speech recognition and financial prediction to gesture recognition [7]-[9]. They have the ability to provide good generalization performance on unseen data but are difficult to train. Recurrent neural networks are dynamical systems and it has shown been that they can represent deterministic finite-state automata in their internal weight representations [10]. Unlike feedforward neural networks, recurrent neural networks contain feedback connections. They are composed of an input layer, a context layer which provides state information, a hidden layer and an output layer. Each layer contains one or more neurons which propagate information from one layer to the next by computing a non-linear function of their weighted sum of inputs. Recurrent neural networks maintain information about their past states for the computation of future states and outputs. Popular architectures of recurrent neural networks include first-order recurrent networks [11], second-order recurrent networks [12], NARX networks [13] and LSTM recurrent networks [14]. A detailed study about the vast variety of recurrent neural networks is beyond the scope of this paper. Fig. 1 is a diagram for first order recurrent neural networks showing the recurrence from the hidden to the context layer. The equation of the dynamics of the change of hidden state neuron activations in first order recurrent neural network is given by Equation 1.

1 1( ) ( 1) ( 1)

K J

i ik k ij jk j

S t g V S t W I t= =

⎛ ⎞= − + −⎜ ⎟

⎝ ⎠∑ ∑ (1)

where ( )kS t and ( )jI t represent the output of the state neuron

and input neurons respectively. ikV and

ijW represent their

corresponding weights. g(.) is a sigmoidal discriminant function.

t-1

Hidden Layer

Input layer

Context layer

wij

Z-1

Output layer

Fig 1 First-order recurrent neural network 2.2 Finite-state automata for RNN training Recurrent neural networks are appropriate tools for modeling real world application problems of speech, signature and gesture recognition and stated earlier. However, these applications are not well suited for addressing their fundamental issues such as training algorithms and knowledge representation. These applications come with specific characteristics, for example, in application to speech recognition feature extraction may be required which may hinder the investigation of the networks fundamental issues. Different applications require different feature extraction techniques. The models such as finite-state automata and their corresponding languages can be viewed as a general paradigm of temporal, symbolic language. There is no feature extraction necessary for recurrent neural networks to learn these languages. The knowledge acquired in recurrent neural networks through learning well corresponds with the dynamics of finite-state automata. The representation of automata is a prerequisite for learning its corresponding languages; i.e. if the architecture cannot represent a particular automaton then it would not be able to learn it either.

Finite automata have been used for training recurrent neural networks as they represent dynamical systems. They have also been used to study knowledge representation in recurrent neural networks and it has been demonstrated through knowledge extraction that RNNs can represent dynamical systems [15]-[17]. Finite-state automata are used as test beds for training recurrent neural networks. They have been popular as they represent dynamical systems and the strings for training do not need to undergo any feature extraction.

An alphabet ∑ is a finite set of symbols. A formal language is a set of strings of symbols over some alphabet. Simple alphabets, e.g. ∑= {0, 1}, are typically considered in the study of formal languages since results can easily be extended to larger alphabets. The set of all strings of odd parity L = {ε, 1, 01, 001, 011, 101 …} is an example of a simple language. The symbol ε is used to denote a null string. The language contains an infinite number of strings.

Fig. 2 A 7 state deterministic finite state automata.

A deterministic finite-state automata (DFA) is defined as a

5-tuple M = (Q, ∑, δ, q1 ,F ), where Q is a finite number of states, ∑ is the input alphabet, δ is the next state function δ : Q × ∑ →Q which defines which state q’ = δ(q,σ) is reached by an automaton after reading symbol σ when in state q, q1 Є Q is the initial state of the automaton (before reading any string) and F ⊆ Q is the set of accepting states of the automaton. The language L(M) accepted by the automaton contains all the strings that bring the automaton to an accepting state. The languages accepted by DFAs are called regular languages. Fig. 2 shows the DFA which will be used for training RNN using the hybrid evolutionary one-step algorithm. Double circles in the figure show accepting states while rejecting states are shown by single circles. State 1 is the automaton’s start state. The training and testing set is obtained upon the presentation of strings to this automaton which gives an output i.e. a rejecting or accepting state depending on the state where the last sequence of the string was presented. For example, the output of a string of length 7, i.e. 0100101 is in state 5 which is an accepting state, therefore the output is 1.

2.3 Gradient descent algorithm for training

neural networks Error backpropagation employs gradient descent learning and is the most popular algorithm used for training neural networks. The goal of gradient descent learning is to minimize the sum of squared errors by propagating error signals backward through the network architecture upon the presentation of training samples from the training set. These error signals are used to calculate the weight updates which represent the knowledge learnt from training. A limitation of gradient descent learning is their tendency of getting trapped in a local minimum during training resulting in poor training and generalization performance.

Backpropagation is used for training feedforward networks while backpropagation-through-time (BPTT) is employed for training recurrent neural networks. BPTT is the spatio-temporal extension of the backpropagation algorithm [18]. The general idea behind BPTT is to unfold the recurrent

neural network in time so that it becomes a deep multilayer feedforward network. This can be done by adding a layer for each time step. When unfolded in time, the network has the same behavior as a recurrent neural network for a finite number of time steps. Gradient descent has the limitation of learning longer term dependencies in recurrent neural networks as the error gradient decreases significantly in longer sequences [19]. The weight update in gradient descent learning is computed by adding jiw∆ to the respective weight

as shown in Equation 2: d

jiji

Eww

α ∂∆ =−

∂ (2)

where dE∂ is the error on training example d, summed over all output units in the network as shown in Equation 3.

2

1

1 ( )2

mL

d j jj

E d S=

= −∑ (3)

Here jd is the desired output for neuron j in the output layer

which contains m neurons, and LjS is the network output of

neuron j in the output layer L. Fig. 4 shows a high level framework of the BPTT which employs gradient decent for error back-propagation and weight update. 2.4 Genetic algorithms for training neural

networks Genetic algorithms provide a learning method motivated by biological evolution [20]. They have been successfully applied to neural network weight updates and to network topology optimization [21]. In recent years, such hybrid approaches to neural networks training have gained popularity and have been applied to real-world problems such as job scheduling [22], forecasting [23] and robotics control [24]. The general idea in using genetic algorithms for training neural networks is to encode weights as chromosomes in a population. The task of genetic algorithms then is to find optimal sets of weights that best represent the knowledge after being presented with the training data in the network. The fitness function is thus the sum of squared errors returned by

__________________________

procedure: Gradient Descent for RNN training

initialize weights and biases while (termination condition is not satisfied) do

i) forward propagate ii) back-propagate error through time and do weight update

end load data for testing the RNN ______________________________

Fig. 4 Description of BPTT for Weight Update

the network after being presented with the weights encoded in chromosomes. Genetic algorithms find the optimal set of weights in a network topology which minimizes the error function. To evaluate the fitness function, each weight encoded in the chromosome is assigned to the respective weight links of the network. The training set of examples is then presented to the network which propagates the input signals forward and the sum of squared errors is calculated. In this way, genetic algorithms attempt to find a set of weights which minimizes the error function of the network. Unlike learning with gradient descent, genetic algorithms can help neural networks to escape from the local minima in weight space. Fig 5 shows a high level description of genetic algorithms for training RNNs. The weights are encoded into chromosomes using either binary or real numbered weight encoding schemes. In binary encoding, a set of genes correspond to a certain weight link [25,26]. The genes are changed into real weight values before being decoded into their respective weight links in order to evaluate the fitness function. Real number encodings are an alternate approach [27]. In order to use this method, genetic operators must be changed as traditional genetic operators are specifically designed for binary chromosomes. One way of altering the genetic operators is as follows: the crossover operator takes two parent chromosomes and creates a single child chromosome by randomly selecting corresponding genetic materials from both parents as shown in Fig. 6. The mutation operator adds a small random real number in the range of -1 and 1 to each gene in the offspring according to the mutation probability.

Fig. 6 The crossover operator in genetic algorithm 3 Hybrid Training Algorithm The strengths and weaknesses of gradient descent and genetic algorithms have been discussed in the previous sections. While genetic algorithms have shown to overcome the problem of local minima, their drawback is evolutionary optimization which can be time consuming. The evolution of weights based can also temporarily direct the network away from the optimal solution. The update of weight and biases in order to minimize the fitness function is common in both gradient decent and genetic algorithms. In the hybrid approach, the gradient information is used in creating a new population on which genetic operators such as crossover and mutation are applied. Genetic operators combine components from two different solutions according to the respective selection criterion and therefore, a better solution can be obtained. The description of the proposed algorithm is shown in Fig. 7.

In the description of the algorithm in Fig 7, a population of hypothesis which represents weights as chromosome is initialized with small random real values. Each chromosome is presented to the network where they are updated using gradient descent which calculates the weight update according to the error from the input to output mappings of the training samples. This weight update is done for one epoch only. Each updated network then becomes part of the new population. Each chromosome in the new population is evaluated according to the fitness function which is the reciprocal of the objective function (e.g. the sum of squared error returned by the network). If the termination condition is not satisfied, then the algorithm proceeds with genetic evolution using crossover and mutation: (1) two parents are chosen by the respective selection criterion such as rank, roulette wheel or tournament selection, (2) an offspring is created from the components of each parent using the crossover operator according to the crossover probability.

____________________________

procedure: Genetic Algorithm for RNN training

initialize population evaluate RNN’s fitness while (termination condition is not satisfied) do

i) crossover and mutation ii) update population iii) evaluate RNN’s fitness

end get the optimal chromosome of weights load data for testing the RNN ________________________________

Fig. 5 Genetic algorithms for evolution of RNN weights

Then each gene in the chromosome is altered by adding a small random number according to the mutation probability.

4 Results and Discussion In the following results, we show the performance of

training of recurrent neural networks with the genetic algorithm alone and our hybrid training method, respectively. In both cases, we randomly initialise all the genes in the chromosomes in the range of [-1, 1]. From trial experiments, we determined a population size of 40 to give the best results; therefore, this population size is used in all the experiments.

We used different combinations of crossover and mutation probabilities of 0.9 and 0.5. For each combination of different crossover and mutation probabilities, we ran 10 experiments. In the implementation of the genetic algorithm, which evolves real numbered weight values from the network, the optimal probabilities of crossover and mutation are important for rapid convergence to a solution. To understand the genetic training process for neural networks, one has to consider that the actual learning takes place during mutation where there is a significant change in the weight values. The crossover operator does not alter the value of the weights in any way; it only exchanges them with its respective selected parent. When using real-valued genetic weight representation, mutation is thus more significant for the learning. Therefore, we ran experiments to find out the optimal probabilities for crossover and mutation. We used the following hybrid weight update strategy: we construct a RNN from a chromosome, perform weight update for one epoch only, and then apply probabilistic crossover and mutation.,

Note that we used the sum of squared error from the network as the fitness function. The crossover operator chooses two parents using roulette wheel selection and creates a child chromosome by probabilistically selecting genes from each parent. The mutation operator adds a small real random number in the range of [-1,1] to each gene in the chromosome. The maximum number of training time allowed was 1000 generations. We used 8 neurons in the hidden layer as it showed successful results for representing a 7 state DFA in trail experiments. The results for training RNN using the hybrid algorithm is shown in Table 1. Table 2 shows the results for training using a standard GA.

The results clearly demonstrate that our hybrid algorithm outperformed training of RNNs with a genetic algorithm alone in terms of training time. The training time has been widely affected by the different combination of the crossover and mutation probabilities. The results are promising which motivates the application of the hybrid training algorithm in training feedforward neural networks. The contribution of our hybrid training algorithm to solving real-world problems looks promising.

__________________________

procedure: Hybrid Training Algorithm for RNN

initialize population evaluate RNN’s fitness while (termination condition is not reached) do

i) crossover and mutation (Genetic evolution)

ii) present each chromosome to RNN using GD weight update for 1 epoch only (neural weight update)

iii) update population iv) evaluate RNN’s fitness

end get the optimal chromosome of weights load data for testing the RNN ________________________________

Fig. 7 Description of the proposed hybrid genetic algorithm/gradient descent training algorithm

TABLE 2: GA Training for RNN

Mutation Crossover Training Accuracy

Generalization Accuracy

Training time

0.5 0.9 100±0% 100±0% 118.9±44.1 0.9 0.9 100±0% 100±0% 63.2±23.4 0.5 0.5 100±0% 100±0% 86.4±21.2 0.9 0.5 100±0% 100±0% 77.9±15.6

The 90 percent confidence interval for 10 experiments done with different values of crossover and mutation is given .The training time is given by the number of ‘generations’. The maximum training time allowed was 1000 generations.

TABLE 1: Hybrid Training Algorithm for RNN

Mutation Crossover Training Accuracy

Generalization Accuracy

Training time

0.5 0.9 100±0% 100±0% 2.7±1.2 0.9 0.9 100±0% 100±0% 4.2±2.0 0.5 0.5 100±0% 100±0% 3.3±0.9 0.9 0.5 100±0% 100±0% 3.5±0.7

The 90 percent confidence interval for 10 experiments done with different values of crossover and mutation is given .The training time is given by the number of ‘generations’. The maximum training time allowed was 1000 generations.

5 Conclusion We have presented a simple hybrid algorithm for training

recurrent neural networks using a combination of gradient descent and genetic algorithm weight update. Each chromosome in the population is modified using one step of gradient descent optimization followed by the application standard crossover and mutation operators. Surprisingly, we have found that this single gradient descent step makes the difference between rapid convergence and non-convergence within 1000 generations when applied to the problem of training recurrent neural networks to behave like deterministic finite- state automata. It would be interesting to see the performance of the Hybrid training algorithm on fuzzy finite-state automata in future works. The contribution of our hybrid training algorithm to solving real-world problems looks promising.

6 References [1] Randall S. Sexton, Robert E. Dorsey, “Reliable

classification using neural networks: a genetic algorithm and backpropagation comparison”, Decision Support Systems, 30, 2000, 11-22.

[2] Rohitash Chandra, Christian. W. Omlin, “Combining Genetic and Gradient Descent Learning in Recurrent Neural Networks: An Application to Speech Phoneme Classification” Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Orlando FL, USA, July 2007, pp. 278-285.

[3] Lixin Lu and Yan-Qing Zhang, “Evolutionary Fuzzy neural networks for Hybrid Financial Prediction”, IEEE Transactions of systems, man and cybernetics-Part C: Applications and Reviews, Vol 35, No. 2, May 2005, pp 244-249.

[4] Jing-Ru Zang, Jun Zhang, Tat-Ming Lok, Michael R. Lyu, “A hybrid particle swarm optimization-backpropagation algorithm for feedforward neural network training,” Applied Mathematics and Computation, 185, 2007, 1026-1037.

[5] Liu Mei, Liu WeiDong, Sun DeQing, Chen Guqiao and Liu Huinian, “A new super-resolution image reconstruction method based on hybrid genetic algorithm”, Proceedings of the 2004 IEEE International Conference on Control Applications, Taipei, Taiwan, 2004, pp. 211-216.

[6] Xindi Cai, Nian Zhang, Ganesh K. Venayagamoorthy, Donald C. Wunsch II, “Time series prediction with recurrent neural networks trained by a hybrid PSO-EA algorithm”, Neurocomputing, 70, 2007, 2342-2353.

[7] A.J Robinson, An application of recurrent nets to phone probability estimation, IEEE transactions on Neural Networks, vol.5, no.2 , 1994, pp. 298-305.

[8] C.L. Giles, S. Lawrence and A.C. Tsoi, Rule inference for financial prediction using recurrent neural networks,

Proc. of the IEEE/IAFE Computational Intelligence for Financial Engineering, New York City, USA, 1997, pp. 253-259

[9] K. Marakami and H Taguchi, Gesture recognition using recurrent neural networks, Proc. of the SIGCHI conference on Human factors in computing systems: Reaching through technology, Louisiana, USA, 1991, pp. 237-242.

[10] C. Lee Giles, C.W Omlin and K. Thornber, “Equivalence in Knowledge Representation: Automata, Recurrent Neural Networks, and dynamical Systems”, Proc. of the IEEE, vol. 87, no. 9, 1999, pp.1623-1640.

[11] P. Manolios and R. Fanelli, First order recurrent neural networks and deterministic finite state automata. Neural Computation, vol. 6, no. 6, 1994, pp.1154-1172.

[12] R. L. Watrous and G. M. Kuhn, Induction of finite-state languages using second-order recurrent networks, Proc. of Advances in Neural Information Systems, California, USA, 1992, pp. 309-316.

[13] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, vol. 7, no. 6, 1996, pp. 1329-1338.

[14] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol. 9, no. 8, 1997, pp. 1735-1780.

[15] C. W. Omlin and C. Giles, “Constructing deterministic finite-state automata in recurrent neural networks”, Journal of the ACM, vol. 43, no. 6, 1996, pp. 937-972.

[16] C. W. Omlin, K. K. Thornber, & C. L. Giles, Fuzzy finite state automata can be deterministically encoded into recurrent neural networks, IEEE Trans. Fuzzy Syst., 6, 1998, 76–89.

[17] R. L. Watrous and G. M. Kuhn, “Induction of finite-state languages using second-order recurrent networks,” Proc. of Advances in Neural Information Systems, California, USA, 1992, pp. 309-316.

[18] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proc. of the IEEE, vol. 78, no. 10, 1990, pp.1550-1560.

[19] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, 1994, pp. 157-166.

[20] J. H. Holland, “Genetic Algorithms and the Optimal Allocation of Trials”, SIAM Journal of Computing, vol. 2, no. 2, 1973, pp. 88-105.

[21] J. H. Ang, K. C. Tan, A. Al-Mamun,” Training neural networks for classification using growth probability-based evolution”, Neurocomputing, 2008, doi:10.1016/j.neucom.2007.10.011

[22] Haibin Yu, Wei Liang, “Neural Network and genetic algorithm based hybrid approach o extended job-scheduling”, Computers and Industrial Engineering, 39. 2001, 337-356.

[23] Harri Niska, Teri Hiltunen, Ari Karppinen, Juhani Ruuskanen, Mikko Kolehmainen, “Evolving the neural

network model for forecasting air pollution time series”, Engineering Applications of Artificial Intelligence, 17, 2004, 159-167.

[24] Genci Capi, Kenji Doya, “Evolution of recurrent neural controllers using an extended parallel genetic algorithm”, Robotic and Autonomous Systems, 52, 2005, 148-159.

[25] P. J. Angeline, G. M. Sauders, and J. B. Pollack, An evolutionary algorithm that constructs recurrent neural networks, IEEE Transactions on Neural Networks, vol. 5, 1994, pp. 54-65.

[26] M. A. Potter and D. Jong, “Evolving neural networks with collaborative species”, Proc. of the Summer Computer Simulation Conference, 1995.

[27] M. Negnevitsky, Artificial Intelligence: A Guide to Intelligence Systems, Addison Wesley, 2004.

hybrid evolutionary one-step gradient descent for training recurrent neural networks

Documents