hybrid recurrent neural networks: an application to phoneme classification

8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

1/6

Hybrid Recurrent Neural Networks:

An Application to Phoneme Classification

Rohitash ChandraSchool of Science and Technology

The University of Fiji

Saweni, Lautoka, Fiji.

Christian W. OmlinDepartment of Computer Science

The University of Western Cape

Cape Town, South Africa.

Abstract -We present a hybrid recurrent neural networks

architecture inspired by hidden Markov models which have

shown to learn and represent dynamical systems. We use

genetic algorithms for training the hybrid architecture and

show their contribution to speech phoneme classification.

We use Mel frequency cepstral coefficient feature extraction

methods on three pair of phonemes obtained from theTIMIT speech database. We use hybrid recurrent neural

networks for modelling each pair of phoneme. Our results

demonstrate that the hybrid architecture can successfully

model speech sequences.

Keywords: Genetic algorithms, Hidden Markov models,

Phoneme classification and Recurrent neural networks.

1 IntroductionRecurrent neural networks have been an important

focus of research as they can be applied to difficultproblems involving time-varying patterns. Their

applications range from speech recognition and financial

prediction to gesture recognition [1]-[3]. Hidden Markov

models, on the other hand, have been very popular in the

field of speech recognition [4]. They have also been applied

to other problems such as gesture recognition [5].

Recurrent neural networks are capable of modelling

complicated recognition tasks. They have shown more

accuracy in speech recognition in cases of low quality noisy

data compared to hidden Markov models. However, hidden

Markov models have shown to perform better when it

comes to large vocabulary speech recognition. Onelimitation for hidden Markov models in the application for

speech recognition is that they assume that the probability

of being in a state at time t only depends on the previous

state i.e. the state at time t-1. This assumption is

inappropriate for speech signals where dependencies often

extend through several states, however, hidden Markov

models have shown extremely well for certain types of

speech recognition. Recurrent neural networks are

dynamical systems and it has been shown that they can

represent deterministic finite automaton in their internal

weight representations [6].

The structural similarity between hidden Markov

models and recurrent neural networks is the basis for

constructing the hybrid recurrent neural networks

architecture. The recurrence equation in the recurrent neuralnetwork resembles the equation in the forwardalgorithm in

the hidden Markov models. The combination of the two

paradigms into a hybrid system may provide better

generalization and training performance which would be a

useful contribution to the field of machine learning and

pattern recognition. We have previously introduced a slight

variation of this architecture and shown that they can

represent dynamical systems [7]. In this paper, we show that

the hybrid recurrent neural network architecture can be

applied to model speech sequences. We use Mel cepstral

frequency coefficients (MFCC) for feature extraction from

the TIMIT speech database. We extract three different pairs

of phonemes and use hybrid recurrent neural networks formodelling.

Evolutionary optimization techniques such as genetic

algorithms have been popular for training neural networks

other than gradient decent learning [8]. It has been

observed that genetic algorithms overcome the problem of

local minima whereas in gradient descent search for the

optimal solution, it may be difficult to drive the network out

of the local minima which in turn proves costly in terms of

training time. In this paper, we will show how genetic

algorithms can be used for training the hybrid recurrent

neural networks architecture inspired by hidden Markov

models. We will use genetic algorithms in training hybridrecurrent neural networks for the classification of phonemes

extracted from the TIMIT speech database. We end this

paper with conclusions from our work and possible

directions for future research.


2/6

2 Definitions and Methods2.1 Recurrent Neural Networks

Recurrent neural networks maintain information

about their past states for the computation of future states

and outputs by using feedback connections. They are

composed of an input layer, a context layer which provides

state information, a hidden layer and an output layer asshown in Fig. 1. Each layer contains one or more processing

units called neurons which propagate information from one

layer to the next by computing a non-linear function of their

weighted sum of inputs.

Popular architectures of recurrent neural networks

include first-order recurrent networks [8], second-order

recurrent networks [9], NARX networks [10] and LSTM

recurrent networks [11]. A detailed study about the vast

variety of recurrent neural networks is beyond the scope of

this paper; however, we will discuss the dynamics of first

order recurrent neural networks, refer to (1).

1 1

( ) ( 1) ( 1)K J

i ik k ij j

k j

S t g V S t W I t = =

= +

(1)

where ( )kS t and ( )jI t represent the output of the state

neurons and input neurons respectively.ikV and ijW

represent their corresponding weights. g(.) is a sigmoidal

discriminant function. We will use this architecture to

construct hybrid architecture of recurrent neural networks

inspired by hidden Markov models and show that they can

learn and model speech sequences.

2.2 Hidden Markov ModelsA hidden Markov model (HMM) describes a process

which goes through a finite number of non-observable states

whilst generating a signal of either discrete or continuous in

nature. In a first order Markov model, the state at time t+1

depends only on state at time t, regardless of the states in

the previous times [12]. Fig. 2 shows an example of a

Markov model containing three states in a stochasticautomaton.

The model probabilistically links the observed signal to the

state transitions in the system. The theory provides a means

by which:

1. The probability P(O|) can be calculated for aHMM with parameter set , generating a particular

observation sequence O, through what is called the

Forwardalgorithm.

2. The most likely state sequence the system wentthrough in generating the observed signal through

the Viterbi algorithm.

3. A set of re-estimation for iteratively updating theHMM parameters given an observation sequence

as training data. These formulas strive to maximize

the probability of the sequence being generated by

the model. The algorithm is known as the Baum-Welch ofForward- backwardprocedure.

The term hidden hints at the process state transition

sequence which is hidden from the observer. The process

reveals itself to the observer only through the generated

observable signal. A HMM is parameterized through a

matrix of transition probabilities between states and output

probability distributions for observed signal frames given

Fig. 1. The architecture of firstorder recurrent neural network.

The recurrence from the hidden to the context layer is shown.

Dashed lines indicate that more neurons can be used in each layer

depending on the application.

Fig. 2. A first order Markov model. i is the probability that the

system will start in state Si and aij is the probability that the

system will move from state Si to state Sj.


3/6

the internal process state. The probabilities are used in the

mentioned algorithm for achieving the desired results.

2.3 Evolutionary Training of RecurrentNeural Networks

Genetic algorithms provide a learning method

motivated by biological evolution. They are searchtechniques that can be used for both solving problems

and modelling evolutionary systems [13]. The problem

faced by genetic algorithms is to search a space of candidate

hypothesis and find the best hypothesis. The hypothesis

fitness is a numerical measure which computes the best

hypothesis that optimizes the problem. The algorithm

operates by iteratively updating a pool of hypothesis, called

thepopulation. The population consists of many individuals

called chromosomes. All members of the population are

evaluated by the fitness function on iteration. A new

population is then generated by probabilistically selecting

the most fit chromosomes from the current population.

Some of the selected chromosomes are added to the newgeneration while others are selected as parent

chromosomes. Parent chromosomes are used for creating

new offsprings by applying genetic operators such as

crossover and mutation. Traditionally, the chromosomes

represent bit strings; however, real number representation is

possible.

In order to use genetic algorithms for training neural

networks, we need to represent the problem as

chromosomes. Real numbered values of weights must be

encoded in the chromosome other than binary values. This

is done by altering the crossover and mutation operators. A

crossover operator takes two parent chromosomes andcreates a single child chromosome by randomly selecting

corresponding genetic materials from both parents. The

mutation operator adds a small random real number

between -1 and 1 to a randomly selected gene in the

chromosome.

In evolutionary neural learning, the task of genetic

algorithms is to find the optimal set of weights in a network

which minimizes the error function. The fitness function

must define the performance of the neural network. Thus,

the fitness function is the reciprocal of sum of squared error

of the neural network. To evaluate the fitness function, each

weight encoded in the chromosome is assigned to therespective weight links of the network. The training set of

examples is then presented to the network which propagates

the information forward and the sum of squared error is

calculated. In this way, genetic algorithms attempts to find a

set of weights which minimizes the error function of the

network. Recurrent neural networks have been trained by

evolutionary computation methods such as genetic

algorithms which optimize the weights in the network

architecture for a particular problem. Compared to gradient

descent learning, genetic algorithms can help the network to

escape from the local minima.

2.4 Speech Phoneme ClassificationA speech sequence contains huge amount of

irrelevant information. In order to model them, feature

extraction is necessary. In feature extraction, useful

information from speech sequences are extracted which thenis used for modelling. Recurrent neural networks and hidden

Markov models have been successfully applied for

modelling speech sequences [1, 4]. They have been applied

to recognize words and phonemes. The performance of

speech recognition system can be measured in terms of

accuracy and speed. Recurrent neural networks are capable

of modelling complicated recognition tasks. They have

shown more accuracy in recognition in cases of low quality

noisy data compared to hidden Markov models. However,

hidden Markov models have shown to perform better when

it comes to large vocabularies. Extensive research on the

application of speech recognition has been done for more

than forty years; however, scientists are unable toimplement systems which can show excellent performance

in environments with background noise.

Mel frequency cepstral coefficients (MFCC) are

useful feature extraction techniques as the Mel filter has

characteristics similar to the human auditory system [14].

The human ear performs similar techniques before

presenting information to the brain for processing. We will

apply MFCC feature extraction techniques to phonemes

obtained from the TIMIT speech database. In MFCC

feature extraction, a frame of speech signal is presented to

the discrete Fourier transformation function to change the

signal from time domain to its frequency domain. Then thediscrete Fourier transformed based spectrum is mapped onto

the Mel scale using triangular overlapping windows.

Finally, we compute the log energy at the output of each

filter and then do a discrete cosine transformation of the

Mel amplitudes which becomes the vector of MFCC

features.

3 Hybrid Recurrent Neural NetworksInspired by Hidden Markov Models

3.1 MotivationWe have stated earlier that the structural similarities

of hidden Markov models and recurrent neural networks

form the basis for combining the two paradigms into the

hybrid architecture. Why is it a good idea? Most often, first

order hidden Markov models are used in practice in which

that state transition probabilities are dependent only on the

previous state. This assumption is unrealistic for many real

world applications of hidden Markov models. Furthermore,

the number of states in the hidden Markov model needs to


4/6

be fixed beforehand for a particular application. Therefore,

the number of states for different applications varies. The

theory on recurrent neural networks and hidden Markov

models suggest that the combination of the two paradigms

may provide better generalization and training performance.

Our proposed architecture of hybrid recurrent neural

networks may also have the capability of learning higher

order dependencies and one does not need to fix the

number of states as done in the case of hidden Markov

models.

3.2 DerivationConsider the equation of the forward procedure for

the calculation of the probability of the observation O given

the model , thus ( | )P O in hidden Markov models. Refer

to (2).

( )( ) ( 1) 1N

j i ij j t

i

t t a b O j N

= (2)

where N is the number of hidden states in the HMM,ij

a is

the probability of making a transition from state i to j and

( )j tb O is the Gaussian distribution for the observation at

time t. The calculation in equation 2 is inherently recurrent

and bares resemblance to the recursion of recurrent neural

networks, refer to (3).

( ) ( 1) 1N

j i ij

i

x t f x t w i N

= (3)

where f(.) s a non-linearity as sigmoid, N the number of

hidden neurons and wij the weights connecting the neurons

with each other and with the input nodes. Equation (1)

describes the dynamics of firstorder recurrent neural

network which is combined with Gaussian distribution

feature in (2) to form the hybrid architecture. We replace

the subscript j in ( )j tb O which in hidden Markov models

denotes the state by time t to incorporate the feature into

recurrent neural networks. Hence the dynamics for the

hybrid recurrent networks architecture is given by:

( )11 1

( ) ( 1) ( 1) .K J

i ik k ij j t

k j

S t f V S t W I t b O

= =

= +

(4)

where ( )1tb O is the Gaussian distribution. Note that the

subscript in ( )1tb O i.e. time t in (4) is different when

compared to the subscript for Gaussian distribution in (2).

The dynamics of hidden Markov models and recurrent

networks varies in this context; however we can adjust the

parameter for time tas shown in (4) in order to map hidden

Markov models into recurrent neural networks. For a single

input, the univariate Gaussian distribution is used, refer to

(5).

( )( )

2

2

1 1exp

22t

Ob O

=

(5)

where tO is the observation at time t,

is the mean and2

is the variance. For multiple inputs to the hybrid recurrent

network, the multivariate Gaussian for ddimensions is used.

Refer to (6).

1

/ 2 1/ 2

1 1(O) exp (O ) (O )

2 | | 2

t

t db

=

(6)

where O is a d-component column vector, is a d-

component mean vector, is a d-by-dcovariance matrix,

and | | and 1 are its determinant and inverse,respectively.

Fig. 3 shows how the Gaussian distribution for

hidden Markov model is used to build hybrid recurrent

neural networks. The output of the multivariate Gaussian

function solely depends on the mean which is a vector equal

to the size of the input vector. This parameter will also be

represented in the chromosomes together with the weights

and biases and will be trained by genetic algorithms in order

to model speech sequences.

( )tb O

Fig. 3. The architecture of hybrid recurrent neural networks. The

dashed lines indicate that the architecture can represent more

neurons in each layer if required. The multivariate Gaussian

function is shown by the shaded node.


5/6

4 Empirical Results and Discussion4.1 MFCC Feature Extraction

We used Mel cepstral frequency coefficients for

feature extraction from three pair of phonemes read from

the TIMIT speech database. For each phoneme read, we

used a frame size of 512 every 256 sample point. We usedMFCC feature extraction technique as discussed previously

and obtained 12 MFCC features for each frame of the

phoneme. The feature extraction results are shown in Table

1 which has information of each phoneme pair extracted.

The frame size of 512 is equal to 32 ms.

Table 1

MFCC Feature Extraction

Phoneme

Pair

No. of

Training

Samples

No. of

Testing

Samples

Frame

Size

b and d 645 238 512k and dx 4199 1412 512

q and jh 4196 1420 512

The three pairs of phonemes extracted from the TIMIT speech

dataset. The frame size of 512 is equal to 32ms.

4.2 Training Hybrid Recurrent NeuralNetworks for Phoneme Classification

In the hybrid recurrent neural networks architecture,

the neuron in the hidden layer compute the weighted sum oftheir inputs and further multiplies with the output of the

corresponding Gaussian function which gets inputs from the

input layer. The product of the neuron and the output of the

Gaussian function are then propagated from the hidden

layer to the output layer as shown in Fig. 3.

We obtained the training and testing data set from

the TIMIT database as shown in Table 1. We trained all the

parameters of the hybrid architecture, i.e. the weights

connecting input to hidden layer, weights connecting hidden

to output layer and weights connecting the context to hidden

layer. We also trained the bias weights and the mean vector

as parameters of the multivariate Gaussian distributionfunction which gets inputs from the input layer. We used the

hybrid recurrent neural network topology as follows: 12

neurons in the input layer which represents the speech

feature input and 2 neurons in the output layer representing

a pair of phonemes. We experimented with different number

of neurons in the hidden layer. We ran sample experiments

and found that the population size of 40, crossover

probability of 0.7 and mutation probability of 0.1 have

shown good genetic training performance; therefore, we use

these values for training. We initialized the hybrid

architecture with random weights in the range of -7 to 7 and

used genetic algorithm for training these weights. In a way,

genetic algorithms shuffle these weights and finds the

optimal sets of weight which best learns the training

samples.

The maximum number of training time was 100

generations. We recorded the performance of the network

being trained by genetic algorithms at every generation until

the maximum training time was reached. We used the best

performance of the network as a stopping criterion for

training. Using the information obtained for the stopping

criteria, we started training. Once the network could

perform well up to the stopping criterion, we halt the

training and present the network with the testing data set

which was not included initially during training. The trained

hybrid recurrent neural network thus uses the knowledge it

gained in the training process to make a generalization. The

results of all experiments are shown in Table 2. The training

and generalization performance is given by the percentage

of samples correctly classified by the training and testing

set, respectively.

Table 2

Hybrid Recurrent Neural Networks for Phoneme

Classification

Phoneme

PairHidden

Neurons

Training

Time

Training

Performance

Generalization

Performance

b, d 12 2 87.75% 82.6%

b, d 15 5 87.75% 82.6%

k, dx 12 2 87.21% 80.92%

k, dx 15 2 87.21% 80.92%

q, jh 12 2 79% 75.81%q', jh 15 3 79% 75.81%

The training time is given by the number of generation it takes

the hybrid architecture to learn the training examples. The

maximum training time is 100 generations. The training and

generalization performance is given by the percentage of

samples correctly classified by the network.

5 ConclusionsWe have successfully combined the strength of

hidden Markov models to construct hybrid recurrent neuralnetworks architecture. The structural similarities between

hidden Markov models and recurrent neural networks have

been the basis for the successful mapping in the hybrid

architecture. We have used genetic algorithms to train the

hybrid system of hidden Markov models and recurrent

neural networks. Our results show that speech sequences

can be trained and modelled by hybrid recurrent neural

networks. An open question remains for the suitability of

the hybrid architecture in modelling other real world

application problems involving temporal sequences.


6/6

6 References[1] A.J Robinson, An application of recurrent nets tophone probability estimation, IEEE transactions on Neural

Networks, Vol. No. 5, Issue No. 2, 298-305, 1994.

[2] C.L. Giles, S. Lawrence and A.C. Tsoi, Ruleinference for financial prediction using recurrent neural

networks, Proc. of the IEEE/IAFE Computational

Intelligence for Financial Engineering, New York City,

USA, 253-259, 1997.

[3] K. Marakami and H Taguchi, Gesture recognitionusing recurrent neural networks, Proc. of the SIGCHI

conference on Human factors in computing systems:

Reaching through technology, Louisiana, USA, 237-242,

1991.

[4] M. J. F. Gales, Maximum likelihood lineartransformations for HMM-based speech recognition,

Computer Speech and Language, Vol. No. 12, 75-98, 1998.

[5] T. Kobayashi, S. Haruyama, Partly-Hidden MarkovModel and its Application to Gesture Recognition, Proc.

of IEEE International Conference on Acoustics, Speech,

and Signal Processing, Vol. No. 4, p. 3081, 1997.

[6] C. Lee Giles, C.W Omlin and K. Thornber,Equivalence in Knowledge Representation: Automata,

Recurrent Neural Networks, and dynamical Systems, Proc.

of the IEEE, Vol. No. 87, Issue No. 9, 1623-1640, 1999.

[7] R. Chandra, C.W. Omlin, Evolutionary training ofhybrid systems of recurrent neural networks and hidden

Markov models, Transactions on Engineering, Computing

and Technology: Proc. of the International Conference on

Neural Networks, Vol. No. 15, Barcelona, Spain, 58-63,

October 2006.

[8] P. Manolios and R. Fanelli, First order recurrentneural networks and deterministic finite state automata,

Neural Computation, Vol. No. 6, Issue No. 6, 1154-1172,

1994.

[9] R. L. Watrous and G. M. Kuhn, Induction of finite-state languages using second-order recurrent networks,

Proc. of Advances in Neural Information Systems,

California, USA, 309-316, 1992.

[10] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, Learninglong-term dependencies in NARX recurrent neural

networks, IEEE Transactions on Neural Networks, Vol.

No. 7, Issue No. 6, 1329-1338, 1996.

[11] S. Hochreiter and J. Schmidhuber, Long short-termmemory, Neural Computation, Vol. No. 9, Issue No. 8,

1735-1780, 1997.

[12] E. Alpaydin, Introduction to Machine Learning, TheMIT Press, London, 306-311, 2004.

[13] T. M. Mitchell, Machine Learning, McGraw Hill,1997.

hybrid recurrent neural networks: an application to phoneme classification

Documents