hybrid recurrent neural networks: an application to phoneme classification

Upload: rohitash

Post on 30-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

    1/6

    Hybrid Recurrent Neural Networks:

    An Application to Phoneme Classification

    Rohitash ChandraSchool of Science and Technology

    The University of Fiji

    Saweni, Lautoka, Fiji.

    Christian W. OmlinDepartment of Computer Science

    The University of Western Cape

    Cape Town, South Africa.

    Abstract -We present a hybrid recurrent neural networks

    architecture inspired by hidden Markov models which have

    shown to learn and represent dynamical systems. We use

    genetic algorithms for training the hybrid architecture and

    show their contribution to speech phoneme classification.

    We use Mel frequency cepstral coefficient feature extraction

    methods on three pair of phonemes obtained from theTIMIT speech database. We use hybrid recurrent neural

    networks for modelling each pair of phoneme. Our results

    demonstrate that the hybrid architecture can successfully

    model speech sequences.

    Keywords: Genetic algorithms, Hidden Markov models,

    Phoneme classification and Recurrent neural networks.

    1 IntroductionRecurrent neural networks have been an important

    focus of research as they can be applied to difficultproblems involving time-varying patterns. Their

    applications range from speech recognition and financial

    prediction to gesture recognition [1]-[3]. Hidden Markov

    models, on the other hand, have been very popular in the

    field of speech recognition [4]. They have also been applied

    to other problems such as gesture recognition [5].

    Recurrent neural networks are capable of modelling

    complicated recognition tasks. They have shown more

    accuracy in speech recognition in cases of low quality noisy

    data compared to hidden Markov models. However, hidden

    Markov models have shown to perform better when it

    comes to large vocabulary speech recognition. Onelimitation for hidden Markov models in the application for

    speech recognition is that they assume that the probability

    of being in a state at time t only depends on the previous

    state i.e. the state at time t-1. This assumption is

    inappropriate for speech signals where dependencies often

    extend through several states, however, hidden Markov

    models have shown extremely well for certain types of

    speech recognition. Recurrent neural networks are

    dynamical systems and it has been shown that they can

    represent deterministic finite automaton in their internal

    weight representations [6].

    The structural similarity between hidden Markov

    models and recurrent neural networks is the basis for

    constructing the hybrid recurrent neural networks

    architecture. The recurrence equation in the recurrent neuralnetwork resembles the equation in the forwardalgorithm in

    the hidden Markov models. The combination of the two

    paradigms into a hybrid system may provide better

    generalization and training performance which would be a

    useful contribution to the field of machine learning and

    pattern recognition. We have previously introduced a slight

    variation of this architecture and shown that they can

    represent dynamical systems [7]. In this paper, we show that

    the hybrid recurrent neural network architecture can be

    applied to model speech sequences. We use Mel cepstral

    frequency coefficients (MFCC) for feature extraction from

    the TIMIT speech database. We extract three different pairs

    of phonemes and use hybrid recurrent neural networks formodelling.

    Evolutionary optimization techniques such as genetic

    algorithms have been popular for training neural networks

    other than gradient decent learning [8]. It has been

    observed that genetic algorithms overcome the problem of

    local minima whereas in gradient descent search for the

    optimal solution, it may be difficult to drive the network out

    of the local minima which in turn proves costly in terms of

    training time. In this paper, we will show how genetic

    algorithms can be used for training the hybrid recurrent

    neural networks architecture inspired by hidden Markov

    models. We will use genetic algorithms in training hybridrecurrent neural networks for the classification of phonemes

    extracted from the TIMIT speech database. We end this

    paper with conclusions from our work and possible

    directions for future research.

  • 8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

    2/6

    2 Definitions and Methods2.1 Recurrent Neural Networks

    Recurrent neural networks maintain information

    about their past states for the computation of future states

    and outputs by using feedback connections. They are

    composed of an input layer, a context layer which provides

    state information, a hidden layer and an output layer asshown in Fig. 1. Each layer contains one or more processing

    units called neurons which propagate information from one

    layer to the next by computing a non-linear function of their

    weighted sum of inputs.

    Popular architectures of recurrent neural networks

    include first-order recurrent networks [8], second-order

    recurrent networks [9], NARX networks [10] and LSTM

    recurrent networks [11]. A detailed study about the vast

    variety of recurrent neural networks is beyond the scope of

    this paper; however, we will discuss the dynamics of first

    order recurrent neural networks, refer to (1).

    1 1

    ( ) ( 1) ( 1)K J

    i ik k ij j

    k j

    S t g V S t W I t = =

    = +

    (1)

    where ( )kS t and ( )jI t represent the output of the state

    neurons and input neurons respectively.ikV and ijW

    represent their corresponding weights. g(.) is a sigmoidal

    discriminant function. We will use this architecture to

    construct hybrid architecture of recurrent neural networks

    inspired by hidden Markov models and show that they can

    learn and model speech sequences.

    2.2 Hidden Markov ModelsA hidden Markov model (HMM) describes a process

    which goes through a finite number of non-observable states

    whilst generating a signal of either discrete or continuous in

    nature. In a first order Markov model, the state at time t+1

    depends only on state at time t, regardless of the states in

    the previous times [12]. Fig. 2 shows an example of a

    Markov model containing three states in a stochasticautomaton.

    The model probabilistically links the observed signal to the

    state transitions in the system. The theory provides a means

    by which:

    1. The probability P(O|) can be calculated for aHMM with parameter set , generating a particular

    observation sequence O, through what is called the

    Forwardalgorithm.

    2. The most likely state sequence the system wentthrough in generating the observed signal through

    the Viterbi algorithm.

    3. A set of re-estimation for iteratively updating theHMM parameters given an observation sequence

    as training data. These formulas strive to maximize

    the probability of the sequence being generated by

    the model. The algorithm is known as the Baum-Welch ofForward- backwardprocedure.

    The term hidden hints at the process state transition

    sequence which is hidden from the observer. The process

    reveals itself to the observer only through the generated

    observable signal. A HMM is parameterized through a

    matrix of transition probabilities between states and output

    probability distributions for observed signal frames given

    Fig. 1. The architecture of firstorder recurrent neural network.

    The recurrence from the hidden to the context layer is shown.

    Dashed lines indicate that more neurons can be used in each layer

    depending on the application.

    Fig. 2. A first order Markov model. i is the probability that the

    system will start in state Si and aij is the probability that the

    system will move from state Si to state Sj.

  • 8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

    3/6

    the internal process state. The probabilities are used in the

    mentioned algorithm for achieving the desired results.

    2.3 Evolutionary Training of RecurrentNeural Networks

    Genetic algorithms provide a learning method

    motivated by biological evolution. They are searchtechniques that can be used for both solving problems

    and modelling evolutionary systems [13]. The problem

    faced by genetic algorithms is to search a space of candidate

    hypothesis and find the best hypothesis. The hypothesis

    fitness is a numerical measure which computes the best

    hypothesis that optimizes the problem. The algorithm

    operates by iteratively updating a pool of hypothesis, called

    thepopulation. The population consists of many individuals

    called chromosomes. All members of the population are

    evaluated by the fitness function on iteration. A new

    population is then generated by probabilistically selecting

    the most fit chromosomes from the current population.

    Some of the selected chromosomes are added to the newgeneration while others are selected as parent

    chromosomes. Parent chromosomes are used for creating

    new offsprings by applying genetic operators such as

    crossover and mutation. Traditionally, the chromosomes

    represent bit strings; however, real number representation is

    possible.

    In order to use genetic algorithms for training neural

    networks, we need to represent the problem as

    chromosomes. Real numbered values of weights must be

    encoded in the chromosome other than binary values. This

    is done by altering the crossover and mutation operators. A

    crossover operator takes two parent chromosomes andcreates a single child chromosome by randomly selecting

    corresponding genetic materials from both parents. The

    mutation operator adds a small random real number

    between -1 and 1 to a randomly selected gene in the

    chromosome.

    In evolutionary neural learning, the task of genetic

    algorithms is to find the optimal set of weights in a network

    which minimizes the error function. The fitness function

    must define the performance of the neural network. Thus,

    the fitness function is the reciprocal of sum of squared error

    of the neural network. To evaluate the fitness function, each

    weight encoded in the chromosome is assigned to therespective weight links of the network. The training set of

    examples is then presented to the network which propagates

    the information forward and the sum of squared error is

    calculated. In this way, genetic algorithms attempts to find a

    set of weights which minimizes the error function of the

    network. Recurrent neural networks have been trained by

    evolutionary computation methods such as genetic

    algorithms which optimize the weights in the network

    architecture for a particular problem. Compared to gradient

    descent learning, genetic algorithms can help the network to

    escape from the local minima.

    2.4 Speech Phoneme ClassificationA speech sequence contains huge amount of

    irrelevant information. In order to model them, feature

    extraction is necessary. In feature extraction, useful

    information from speech sequences are extracted which thenis used for modelling. Recurrent neural networks and hidden

    Markov models have been successfully applied for

    modelling speech sequences [1, 4]. They have been applied

    to recognize words and phonemes. The performance of

    speech recognition system can be measured in terms of

    accuracy and speed. Recurrent neural networks are capable

    of modelling complicated recognition tasks. They have

    shown more accuracy in recognition in cases of low quality

    noisy data compared to hidden Markov models. However,

    hidden Markov models have shown to perform better when

    it comes to large vocabularies. Extensive research on the

    application of speech recognition has been done for more

    than forty years; however, scientists are unable toimplement systems which can show excellent performance

    in environments with background noise.

    Mel frequency cepstral coefficients (MFCC) are

    useful feature extraction techniques as the Mel filter has

    characteristics similar to the human auditory system [14].

    The human ear performs similar techniques before

    presenting information to the brain for processing. We will

    apply MFCC feature extraction techniques to phonemes

    obtained from the TIMIT speech database. In MFCC

    feature extraction, a frame of speech signal is presented to

    the discrete Fourier transformation function to change the

    signal from time domain to its frequency domain. Then thediscrete Fourier transformed based spectrum is mapped onto

    the Mel scale using triangular overlapping windows.

    Finally, we compute the log energy at the output of each

    filter and then do a discrete cosine transformation of the

    Mel amplitudes which becomes the vector of MFCC

    features.

    3 Hybrid Recurrent Neural NetworksInspired by Hidden Markov Models

    3.1 MotivationWe have stated earlier that the structural similarities

    of hidden Markov models and recurrent neural networks

    form the basis for combining the two paradigms into the

    hybrid architecture. Why is it a good idea? Most often, first

    order hidden Markov models are used in practice in which

    that state transition probabilities are dependent only on the

    previous state. This assumption is unrealistic for many real

    world applications of hidden Markov models. Furthermore,

    the number of states in the hidden Markov model needs to

  • 8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

    4/6

    be fixed beforehand for a particular application. Therefore,

    the number of states for different applications varies. The

    theory on recurrent neural networks and hidden Markov

    models suggest that the combination of the two paradigms

    may provide better generalization and training performance.

    Our proposed architecture of hybrid recurrent neural

    networks may also have the capability of learning higher

    order dependencies and one does not need to fix the

    number of states as done in the case of hidden Markov

    models.

    3.2 DerivationConsider the equation of the forward procedure for

    the calculation of the probability of the observation O given

    the model , thus ( | )P O in hidden Markov models. Refer

    to (2).

    ( )( ) ( 1) 1N

    j i ij j t

    i

    t t a b O j N

    = (2)

    where N is the number of hidden states in the HMM,ij

    a is

    the probability of making a transition from state i to j and

    ( )j tb O is the Gaussian distribution for the observation at

    time t. The calculation in equation 2 is inherently recurrent

    and bares resemblance to the recursion of recurrent neural

    networks, refer to (3).

    ( ) ( 1) 1N

    j i ij

    i

    x t f x t w i N

    = (3)

    where f(.) s a non-linearity as sigmoid, N the number of

    hidden neurons and wij the weights connecting the neurons

    with each other and with the input nodes. Equation (1)

    describes the dynamics of firstorder recurrent neural

    network which is combined with Gaussian distribution

    feature in (2) to form the hybrid architecture. We replace

    the subscript j in ( )j tb O which in hidden Markov models

    denotes the state by time t to incorporate the feature into

    recurrent neural networks. Hence the dynamics for the

    hybrid recurrent networks architecture is given by:

    ( )11 1

    ( ) ( 1) ( 1) .K J

    i ik k ij j t

    k j

    S t f V S t W I t b O

    = =

    = +

    (4)

    where ( )1tb O is the Gaussian distribution. Note that the

    subscript in ( )1tb O i.e. time t in (4) is different when

    compared to the subscript for Gaussian distribution in (2).

    The dynamics of hidden Markov models and recurrent

    networks varies in this context; however we can adjust the

    parameter for time tas shown in (4) in order to map hidden

    Markov models into recurrent neural networks. For a single

    input, the univariate Gaussian distribution is used, refer to

    (5).

    ( )( )

    2

    2

    1 1exp

    22t

    Ob O

    =

    (5)

    where tO is the observation at time t,

    is the mean and2

    is the variance. For multiple inputs to the hybrid recurrent

    network, the multivariate Gaussian for ddimensions is used.

    Refer to (6).

    1

    / 2 1/ 2

    1 1(O) exp (O ) (O )

    2 | | 2

    t

    t db

    =

    (6)

    where O is a d-component column vector, is a d-

    component mean vector, is a d-by-dcovariance matrix,

    and | | and 1 are its determinant and inverse,respectively.

    Fig. 3 shows how the Gaussian distribution for

    hidden Markov model is used to build hybrid recurrent

    neural networks. The output of the multivariate Gaussian

    function solely depends on the mean which is a vector equal

    to the size of the input vector. This parameter will also be

    represented in the chromosomes together with the weights

    and biases and will be trained by genetic algorithms in order

    to model speech sequences.

    ( )tb O

    Fig. 3. The architecture of hybrid recurrent neural networks. The

    dashed lines indicate that the architecture can represent more

    neurons in each layer if required. The multivariate Gaussian

    function is shown by the shaded node.

  • 8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

    5/6

    4 Empirical Results and Discussion4.1 MFCC Feature Extraction

    We used Mel cepstral frequency coefficients for

    feature extraction from three pair of phonemes read from

    the TIMIT speech database. For each phoneme read, we

    used a frame size of 512 every 256 sample point. We usedMFCC feature extraction technique as discussed previously

    and obtained 12 MFCC features for each frame of the

    phoneme. The feature extraction results are shown in Table

    1 which has information of each phoneme pair extracted.

    The frame size of 512 is equal to 32 ms.

    Table 1

    MFCC Feature Extraction

    Phoneme

    Pair

    No. of

    Training

    Samples

    No. of

    Testing

    Samples

    Frame

    Size

    b and d 645 238 512k and dx 4199 1412 512

    q and jh 4196 1420 512

    The three pairs of phonemes extracted from the TIMIT speech

    dataset. The frame size of 512 is equal to 32ms.

    4.2 Training Hybrid Recurrent NeuralNetworks for Phoneme Classification

    In the hybrid recurrent neural networks architecture,

    the neuron in the hidden layer compute the weighted sum oftheir inputs and further multiplies with the output of the

    corresponding Gaussian function which gets inputs from the

    input layer. The product of the neuron and the output of the

    Gaussian function are then propagated from the hidden

    layer to the output layer as shown in Fig. 3.

    We obtained the training and testing data set from

    the TIMIT database as shown in Table 1. We trained all the

    parameters of the hybrid architecture, i.e. the weights

    connecting input to hidden layer, weights connecting hidden

    to output layer and weights connecting the context to hidden

    layer. We also trained the bias weights and the mean vector

    as parameters of the multivariate Gaussian distributionfunction which gets inputs from the input layer. We used the

    hybrid recurrent neural network topology as follows: 12

    neurons in the input layer which represents the speech

    feature input and 2 neurons in the output layer representing

    a pair of phonemes. We experimented with different number

    of neurons in the hidden layer. We ran sample experiments

    and found that the population size of 40, crossover

    probability of 0.7 and mutation probability of 0.1 have

    shown good genetic training performance; therefore, we use

    these values for training. We initialized the hybrid

    architecture with random weights in the range of -7 to 7 and

    used genetic algorithm for training these weights. In a way,

    genetic algorithms shuffle these weights and finds the

    optimal sets of weight which best learns the training

    samples.

    The maximum number of training time was 100

    generations. We recorded the performance of the network

    being trained by genetic algorithms at every generation until

    the maximum training time was reached. We used the best

    performance of the network as a stopping criterion for

    training. Using the information obtained for the stopping

    criteria, we started training. Once the network could

    perform well up to the stopping criterion, we halt the

    training and present the network with the testing data set

    which was not included initially during training. The trained

    hybrid recurrent neural network thus uses the knowledge it

    gained in the training process to make a generalization. The

    results of all experiments are shown in Table 2. The training

    and generalization performance is given by the percentage

    of samples correctly classified by the training and testing

    set, respectively.

    Table 2

    Hybrid Recurrent Neural Networks for Phoneme

    Classification

    Phoneme

    PairHidden

    Neurons

    Training

    Time

    Training

    Performance

    Generalization

    Performance

    b, d 12 2 87.75% 82.6%

    b, d 15 5 87.75% 82.6%

    k, dx 12 2 87.21% 80.92%

    k, dx 15 2 87.21% 80.92%

    q, jh 12 2 79% 75.81%q', jh 15 3 79% 75.81%

    The training time is given by the number of generation it takes

    the hybrid architecture to learn the training examples. The

    maximum training time is 100 generations. The training and

    generalization performance is given by the percentage of

    samples correctly classified by the network.

    5 ConclusionsWe have successfully combined the strength of

    hidden Markov models to construct hybrid recurrent neuralnetworks architecture. The structural similarities between

    hidden Markov models and recurrent neural networks have

    been the basis for the successful mapping in the hybrid

    architecture. We have used genetic algorithms to train the

    hybrid system of hidden Markov models and recurrent

    neural networks. Our results show that speech sequences

    can be trained and modelled by hybrid recurrent neural

    networks. An open question remains for the suitability of

    the hybrid architecture in modelling other real world

    application problems involving temporal sequences.

  • 8/14/2019 Hybrid Recurrent Neural Networks: An Application to Phoneme Classification

    6/6

    6 References[1] A.J Robinson, An application of recurrent nets tophone probability estimation, IEEE transactions on Neural

    Networks, Vol. No. 5, Issue No. 2, 298-305, 1994.

    [2] C.L. Giles, S. Lawrence and A.C. Tsoi, Ruleinference for financial prediction using recurrent neural

    networks, Proc. of the IEEE/IAFE Computational

    Intelligence for Financial Engineering, New York City,

    USA, 253-259, 1997.

    [3] K. Marakami and H Taguchi, Gesture recognitionusing recurrent neural networks, Proc. of the SIGCHI

    conference on Human factors in computing systems:

    Reaching through technology, Louisiana, USA, 237-242,

    1991.

    [4] M. J. F. Gales, Maximum likelihood lineartransformations for HMM-based speech recognition,

    Computer Speech and Language, Vol. No. 12, 75-98, 1998.

    [5] T. Kobayashi, S. Haruyama, Partly-Hidden MarkovModel and its Application to Gesture Recognition, Proc.

    of IEEE International Conference on Acoustics, Speech,

    and Signal Processing, Vol. No. 4, p. 3081, 1997.

    [6] C. Lee Giles, C.W Omlin and K. Thornber,Equivalence in Knowledge Representation: Automata,

    Recurrent Neural Networks, and dynamical Systems, Proc.

    of the IEEE, Vol. No. 87, Issue No. 9, 1623-1640, 1999.

    [7] R. Chandra, C.W. Omlin, Evolutionary training ofhybrid systems of recurrent neural networks and hidden

    Markov models, Transactions on Engineering, Computing

    and Technology: Proc. of the International Conference on

    Neural Networks, Vol. No. 15, Barcelona, Spain, 58-63,

    October 2006.

    [8] P. Manolios and R. Fanelli, First order recurrentneural networks and deterministic finite state automata,

    Neural Computation, Vol. No. 6, Issue No. 6, 1154-1172,

    1994.

    [9] R. L. Watrous and G. M. Kuhn, Induction of finite-state languages using second-order recurrent networks,

    Proc. of Advances in Neural Information Systems,

    California, USA, 309-316, 1992.

    [10] T. Lin, B.G. Horne, P. Tino, & C.L. Giles, Learninglong-term dependencies in NARX recurrent neural

    networks, IEEE Transactions on Neural Networks, Vol.

    No. 7, Issue No. 6, 1329-1338, 1996.

    [11] S. Hochreiter and J. Schmidhuber, Long short-termmemory, Neural Computation, Vol. No. 9, Issue No. 8,

    1735-1780, 1997.

    [12] E. Alpaydin, Introduction to Machine Learning, TheMIT Press, London, 306-311, 2004.

    [13] T. M. Mitchell, Machine Learning, McGraw Hill,1997.