hybrid recurrent neural networks: an application to phoneme classification
TRANSCRIPT
-
8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification
1/6
Hybrid Recurrent Neural Networks:
An Application to Phoneme Classification
Rohitash ChandraSchool of Science and Technology
The University of Fiji
Saweni, Lautoka, Fiji.
Christian W. OmlinDepartment of Computer Science
The University of Western Cape
Cape Town, South Africa.
Abstract -We present a hybrid recurrent neural networks
architecture inspired by hidden Markov models which have
shown to learn and represent dynamical systems. We use
genetic algorithms for training the hybrid architecture and
show their contribution to speech phoneme classification.
We use Mel frequency cepstral coefficient feature extraction
methods on three pair of phonemes obtained from theTIMIT speech database. We use hybrid recurrent neural
networks for modelling each pair of phoneme. Our results
demonstrate that the hybrid architecture can successfully
model speech sequences.
Keywords: Genetic algorithms, Hidden Markov models,
Phoneme classification and Recurrent neural networks.
1 IntroductionRecurrent neural networks have been an important
focus of research as they can be applied to difficultproblems involving time-varying patterns. Their
applications range from speech recognition and financial
prediction to gesture recognition [1]-[3]. Hidden Markov
models, on the other hand, have been very popular in the
field of speech recognition [4]. They have also been applied
to other problems such as gesture recognition [5].
Recurrent neural networks are capable of modelling
complicated recognition tasks. They have shown more
accuracy in speech recognition in cases of low quality noisy
data compared to hidden Markov models. However, hidden
Markov models have shown to perform better when it
comes to large vocabulary speech recognition. Onelimitation for hidden Markov models in the application for
speech recognition is that they assume that the probability
of being in a state at time t only depends on the previous
state i.e. the state at time t-1. This assumption is
inappropriate for speech signals where dependencies often
extend through several states, however, hidden Markov
models have shown extremely well for certain types of
speech recognition. Recurrent neural networks are
dynamical systems and it has been shown that they can
represent deterministic finite automaton in their internal
weight representations [6].
The structural similarity between hidden Markov
models and recurrent neural networks is the basis for
constructing the hybrid recurrent neural networks
architecture. The recurrence equation in the recurrent neuralnetwork resembles the equation in the forwardalgorithm in
the hidden Markov models. The combination of the two
paradigms into a hybrid system may provide better
generalization and training performance which would be a
useful contribution to the field of machine learning and
pattern recognition. We have previously introduced a slight
variation of this architecture and shown that they can
represent dynamical systems [7]. In this paper, we show that
the hybrid recurrent neural network architecture can be
applied to model speech sequences. We use Mel cepstral
frequency coefficients (MFCC) for feature extraction from
the TIMIT speech database. We extract three different pairs
of phonemes and use hybrid recurrent neural networks formodelling.
Evolutionary optimization techniques such as genetic
algorithms have been popular for training neural networks
other than gradient decent learning [8]. It has been
observed that genetic algorithms overcome the problem of
local minima whereas in gradient descent search for the
optimal solution, it may be difficult to drive the network out
of the local minima which in turn proves costly in terms of
training time. In this paper, we will show how genetic
algorithms can be used for training the hybrid recurrent
neural networks architecture inspired by hidden Markov
models. We will use genetic algorithms in training hybridrecurrent neural networks for the classification of phonemes
extracted from the TIMIT speech database. We end this
paper with conclusions from our work and possible
directions for future research.
-
8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification
2/6
2 Definitions and Methods2.1 Recurrent Neural Networks
Recurrent neural networks maintain information
about their past states for the computation of future states
and outputs by using feedback connections. They are
composed of an input layer, a context layer which provides
state information, a hidden layer and an output layer asshown in Fig. 1. Each layer contains one or more processing
units called neurons which propagate information from one
layer to the next by computing a non-linear function of their
weighted sum of inputs.
Popular architectures of recurrent neural networks
include first-order recurrent networks [8], second-order
recurrent networks [9], NARX networks [10] and LSTM
recurrent networks [11]. A detailed study about the vast
variety of recurrent neural networks is beyond the scope of
this paper; however, we will discuss the dynamics of first
order recurrent neural networks, refer to (1).
1 1
( ) ( 1) ( 1)K J
i ik k ij j
k j
S t g V S t W I t = =
= +
(1)
where ( )kS t and ( )jI t represent the output of the state
neurons and input neurons respectively.ikV and ijW
represent their corresponding weights. g(.) is a sigmoidal
discriminant function. We will use this architecture to
construct hybrid architecture of recurrent neural networks
inspired by hidden Markov models and show that they can
learn and model speech sequences.
2.2 Hidden Markov ModelsA hidden Markov model (HMM) describes a process
which goes through a finite number of non-observable states
whilst generating a signal of either discrete or continuous in
nature. In a first order Markov model, the state at time t+1
depends only on state at time t, regardless of the states in
the previous times [12]. Fig. 2 shows an example of a
Markov model containing three states in a stochasticautomaton.
The model probabilistically links the observed signal to the
state transitions in the system. The theory provides a means
by which:
1. The probability P(O|) can be calculated for aHMM with parameter set , generating a particular
observation sequence O, through what is called the
Forwardalgorithm.
2. The most likely state sequence the system wentthrough in generating the observed signal through
the Viterbi algorithm.
3. A set of re-estimation for iteratively updating theHMM parameters given an observation sequence
as training data. These formulas strive to maximize
the probability of the sequence being generated by
the model. The algorithm is known as the Baum-Welch ofForward- backwardprocedure.
The term hidden hints at the process state transition
sequence which is hidden from the observer. The process
reveals itself to the observer only through the generated
observable signal. A HMM is parameterized through a
matrix of transition probabilities between states and output
probability distributions for observed signal frames given
Fig. 1. The architecture of firstorder recurrent neural network.
The recurrence from the hidden to the context layer is shown.
Dashed lines indicate that more neurons can be used in each layer
depending on the application.
Fig. 2. A first order Markov model. i is the probability that the
system will start in state Si and aij is the probability that the
system will move from state Si to state Sj.
-
8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification
3/6
the internal process state. The probabilities are used in the
mentioned algorithm for achieving the desired results.
2.3 Evolutionary Training of RecurrentNeural Networks
Genetic algorithms provide a learning method
motivated by biological evolution. They are searchtechniques that can be used for both solving problems
and modelling evolutionary systems [13]. The problem
faced by genetic algorithms is to search a space of candidate
hypothesis and find the best hypothesis. The hypothesis
fitness is a numerical measure which computes the best
hypothesis that optimizes the problem. The algorithm
operates by iteratively updating a pool of hypothesis, called
thepopulation. The population consists of many individuals
called chromosomes. All members of the population are
evaluated by the fitness function on iteration. A new
population is then generated by probabilistically selecting
the most fit chromosomes from the current population.
Some of the selected chromosomes are added to the newgeneration while others are selected as parent
chromosomes. Parent chromosomes are used for creating
new offsprings by applying genetic operators such as
crossover and mutation. Traditionally, the chromosomes
represent bit strings; however, real number representation is
possible.
In order to use genetic algorithms for training neural
networks, we need to represent the problem as
chromosomes. Real numbered values of weights must be
encoded in the chromosome other than binary values. This
is done by altering the crossover and mutation operators. A
crossover operator takes two parent chromosomes andcreates a single child chromosome by randomly selecting
corresponding genetic materials from both parents. The
mutation operator adds a small random real number
between -1 and 1 to a randomly selected gene in the
chromosome.
In evolutionary neural learning, the task of genetic
algorithms is to find the optimal set of weights in a network
which minimizes the error function. The fitness function
must define the performance of the neural network. Thus,
the fitness function is the reciprocal of sum of squared error
of the neural network. To evaluate the fitness function, each
weight encoded in the chromosome is assigned to therespective weight links of the network. The training set of
examples is then presented to the network which propagates
the information forward and the sum of squared error is
calculated. In this way, genetic algorithms attempts to find a
set of weights which minimizes the error function of the
network. Recurrent neural networks have been trained by
evolutionary computation methods such as genetic
algorithms which optimize the weights in the network
architecture for a particular problem. Compared to gradient
descent learning, genetic algorithms can help the network to
escape from the local minima.
2.4 Speech Phoneme ClassificationA speech sequence contains huge amount of
irrelevant information. In order to model them, feature
extraction is necessary. In feature extraction, useful
information from speech sequences are extracted which thenis used for modelling. Recurrent neural networks and hidden
Markov models have been successfully applied for
modelling speech sequences [1, 4]. They have been applied
to recognize words and phonemes. The performance of
speech recognition system can be measured in terms of
accuracy and speed. Recurrent neural networks are capable
of modelling complicated recognition tasks. They have
shown more accuracy in recognition in cases of low quality
noisy data compared to hidden Markov models. However,
hidden Markov models have shown to perform better when
it comes to large vocabularies. Extensive research on the
application of speech recognition has been done for more
than forty years; however, scientists are unable toimplement systems which can show excellent performance
in environments with background noise.
Mel frequency cepstral coefficients (MFCC) are
useful feature extraction techniques as the Mel filter has
characteristics similar to the human auditory system [14].
The human ear performs similar techniques before
presenting information to the brain for processing. We will
apply MFCC feature extraction techniques to phonemes
obtained from the TIMIT speech database. In MFCC
feature extraction, a frame of speech signal is presented to
the discrete Fourier transformation function to change the
signal from time domain to its frequency domain. Then thediscrete Fourier transformed based spectrum is mapped onto
the Mel scale using triangular overlapping windows.
Finally, we compute the log energy at the output of each
filter and then do a discrete cosine transformation of the
Mel amplitudes which becomes the vector of MFCC
features.
3 Hybrid Recurrent Neural NetworksInspired by Hidden Markov Models
3.1 MotivationWe have stated earlier that the structural similarities
of hidden Markov models and recurrent neural networks
form the basis for combining the two paradigms into the
hybrid architecture. Why is it a good idea? Most often, first
order hidden Markov models are used in practice in which
that state transition probabilities are dependent only on the
previous state. This assumption is unrealistic for many real
world applications of hidden Markov models. Furthermore,
the number of states in the hidden Markov model needs to
-
8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification
4/6
be fixed beforehand for a particular application. Therefore,
the number of states for different applications varies. The
theory on recurrent neural networks and hidden Markov
models suggest that the combination of the two paradigms
may provide better generalization and training performance.
Our proposed architecture of hybrid recurrent neural
networks may also have the capability of learning higher
order dependencies and one does not need to fix the
number of states as done in the case of hidden Markov
models.
3.2 DerivationConsider the equation of the forward procedure for
the calculation of the probability of the observation O given
the model , thus ( | )P O in hidden Markov models. Refer
to (2).
( )( ) ( 1) 1N
j i ij j t
i
t t a b O j N
= (2)
where N is the number of hidden states in the HMM,ij
a is
the probability of making a transition from state i to j and
( )j tb O is the Gaussian distribution for the observation at
time t. The calculation in equation 2 is inherently recurrent
and bares resemblance to the recursion of recurrent neural
networks, refer to (3).
( ) ( 1) 1N
j i ij
i
x t f x t w i N
= (3)
where f(.) s a non-linearity as sigmoid, N the number of
hidden neurons and wij the weights connecting the neurons
with each other and with the input nodes. Equation (1)
describes the dynamics of firstorder recurrent neural
network which is combined with Gaussian distribution
feature in (2) to form the hybrid architecture. We replace
the subscript j in ( )j tb O which in hidden Markov models
denotes the state by time t to incorporate the feature into
recurrent neural networks. Hence the dynamics for the
hybrid recurrent networks architecture is given by:
( )11 1
( ) ( 1) ( 1) .K J
i ik k ij j t
k j
S t f V S t W I t b O
= =
= +
(4)
where ( )1tb O is the Gaussian distribution. Note that the
subscript in ( )1tb O i.e. time t in (4) is different when
compared to the subscript for Gaussian distribution in (2).
The dynamics of hidden Markov models and recurrent
networks varies in this context; however we can adjust the
parameter for time tas shown in (4) in order to map hidden
Markov models into recurrent neural networks. For a single
input, the univariate Gaussian distribution is used, refer to
(5).
( )( )
2
2
1 1exp
22t
Ob O
=
(5)
where tO is the observation at time t,
is the mean and2
is the variance. For multiple inputs to the hybrid recurrent
network, the multivariate Gaussian for ddimensions is used.
Refer to (6).
1
/ 2 1/ 2
1 1(O) exp (O ) (O )
2 | | 2
t
t db
=
(6)
where O is a d-component column vector, is a d-
component mean vector, is a d-by-dcovariance matrix,
and | | and 1 are its determinant and inverse,respectively.
Fig. 3 shows how the Gaussian distribution for
hidden Markov model is used to build hybrid recurrent
neural networks. The output of the multivariate Gaussian
function solely depends on the mean which is a vector equal
to the size of the input vector. This parameter will also be
represented in the chromosomes together with the weights
and biases and will be trained by genetic algorithms in order
to model speech sequences.
( )tb O
Fig. 3. The architecture of hybrid recurrent neural networks. The
dashed lines indicate that the architecture can represent more
neurons in each layer if required. The multivariate Gaussian
function is shown by the shaded node.
-
8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification
5/6
4 Empirical Results and Discussion4.1 MFCC Feature Extraction
We used Mel cepstral frequency coefficients for
feature extraction from three pair of phonemes read from
the TIMIT speech database. For each phoneme read, we
used a frame size of 512 every 256 sample point. We usedMFCC feature extraction technique as discussed previously
and obtained 12 MFCC features for each frame of the
phoneme. The feature extraction results are shown in Table
1 which has information of each phoneme pair extracted.
The frame size of 512 is equal to 32 ms.
Table 1
MFCC Feature Extraction
Phoneme
Pair
No. of
Training
Samples
No. of
Testing
Samples
Frame
Size
b and d 645 238 512k and dx 4199 1412 512
q and jh 4196 1420 512
The three pairs of phonemes extracted from the TIMIT speech
dataset. The frame size of 512 is equal to 32ms.
4.2 Training Hybrid Recurrent NeuralNetworks for Phoneme Classification
In the hybrid recurrent neural networks architecture,
the neuron in the hidden layer compute the weighted sum oftheir inputs and further multiplies with the output of the
corresponding Gaussian function which gets inputs from the
input layer. The product of the neuron and the output of the
Gaussian function are then propagated from the hidden
layer to the output layer as shown in Fig. 3.
We obtained the training and testing data set from
the TIMIT database as shown in Table 1. We trained all the
parameters of the hybrid architecture, i.e. the weights
connecting input to hidden layer, weights connecting hidden
to output layer and weights connecting the context to hidden
layer. We also trained the bias weights and the mean vector
as parameters of the multivariate Gaussian distributionfunction which gets inputs from the input layer. We used the
hybrid recurrent neural network topology as follows: 12
neurons in the input layer which represents the speech
feature input and 2 neurons in the output layer representing
a pair of phonemes. We experimented with different number
of neurons in the hidden layer. We ran sample experiments
and found that the population size of 40, crossover
probability of 0.7 and mutation probability of 0.1 have
shown good genetic training performance; therefore, we use
these values for training. We initialized the hybrid
architecture with random weights in the range of -7 to 7 and
used genetic algorithm for training these weights. In a way,
genetic algorithms shuffle these weights and finds the
optimal sets of weight which best learns the training
samples.
The maximum number of training time was 100
generations. We recorded the performance of the network
being trained by genetic algorithms at every generation until
the maximum training time was reached. We used the best
performance of the network as a stopping criterion for
training. Using the information obtained for the stopping
criteria, we started training. Once the network could
perform well up to the stopping criterion, we halt the
training and present the network with the testing data set
which was not included initially during training. The trained
hybrid recurrent neural network thus uses the knowledge it
gained in the training process to make a generalization. The
results of all experiments are shown in Table 2. The training
and generalization performance is given by the percentage
of samples correctly classified by the training and testing
set, respectively.
Table 2
Hybrid Recurrent Neural Networks for Phoneme
Classification
Phoneme
PairHidden
Neurons
Training
Time
Training
Performance
Generalization
Performance
b, d 12 2 87.75% 82.6%
b, d 15 5 87.75% 82.6%
k, dx 12 2 87.21% 80.92%
k, dx 15 2 87.21% 80.92%
q, jh 12 2 79% 75.81%q', jh 15 3 79% 75.81%
The training time is given by the number of generation it takes
the hybrid architecture to learn the training examples. The
maximum training time is 100 generations. The training and
generalization performance is given by the percentage of
samples correctly classified by the network.
5 ConclusionsWe have successfully combined the strength of
hidden Markov models to construct hybrid recurrent neuralnetworks architecture. The structural similarities between
hidden Markov models and recurrent neural networks have
been the basis for the successful mapping in the hybrid
architecture. We have used genetic algorithms to train the
hybrid system of hidden Markov models and recurrent
neural networks. Our results show that speech sequences
can be trained and modelled by hybrid recurrent neural
networks. An open question remains for the suitability of
the hybrid architecture in modelling other real world
application problems involving temporal sequences.
-
8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification
6/6
6 References[1] A.J Robinson, An application of recurrent nets tophone probability estimation, IEEE transactions on Neural
Networks, Vol. No. 5, Issue No. 2, 298-305, 1994.
[2] C.L. Giles, S. Lawrence and A.C. Tsoi, Ruleinference for financial prediction using recurrent neural
networks, Proc. of the IEEE/IAFE Computational
Intelligence for Financial Engineering, New York City,
USA, 253-259, 1997.
[3] K. Marakami and H Taguchi, Gesture recognitionusing recurrent neural networks, Proc. of the SIGCHI
conference on Human factors in computing systems:
Reaching through technology, Louisiana, USA, 237-242,
1991.
[4] M. J. F. Gales, Maximum likelihood lineartransformations for HMM-based speech recognition,
Computer Speech and Language, Vol. No. 12, 75-98, 1998.
[5] T. Kobayashi, S. Haruyama, Partly-Hidden MarkovModel and its Application to Gesture Recognition, Proc.
of IEEE International Conference on Acoustics, Speech,
and Signal Processing, Vol. No. 4, p. 3081, 1997.
[6] C. Lee Giles, C.W Omlin and K. Thornber,Equivalence in Knowledge Representation: Automata,
Recurrent Neural Networks, and dynamical Systems, Proc.
of the IEEE, Vol. No. 87, Issue No. 9, 1623-1640, 1999.
[7] R. Chandra, C.W. Omlin, Evolutionary training ofhybrid systems of recurrent neural networks and hidden
Markov models, Transactions on Engineering, Computing
and Technology: Proc. of the International Conference on
Neural Networks, Vol. No. 15, Barcelona, Spain, 58-63,
October 2006.
[8] P. Manolios and R. Fanelli, First order recurrentneural networks and deterministic finite state automata,
Neural Computation, Vol. No. 6, Issue No. 6, 1154-1172,
1994.
[9] R. L. Watrous and G. M. Kuhn, Induction of finite-state languages using second-order recurrent networks,
Proc. of Advances in Neural Information Systems,
California, USA, 309-316, 1992.
[10] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, Learninglong-term dependencies in NARX recurrent neural
networks, IEEE Transactions on Neural Networks, Vol.
No. 7, Issue No. 6, 1329-1338, 1996.
[11] S. Hochreiter and J. Schmidhuber, Long short-termmemory, Neural Computation, Vol. No. 9, Issue No. 8,
1735-1780, 1997.
[12] E. Alpaydin, Introduction to Machine Learning, TheMIT Press, London, 306-311, 2004.
[13] T. M. Mitchell, Machine Learning, McGraw Hill,1997.