automatic speech recognition using dynamic bayesian...

84
Automatic Speech Recognition using Dynamic Bayesian Networks Rob van de Lisdonk Faculty Electrical Engineering, Mathematics and Computer Science Delft University of Technology June 2009

Upload: others

Post on 09-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Automatic Speech Recognition usingDynamic Bayesian Networks

Rob van de Lisdonk

Faculty Electrical Engineering, Mathematics andComputer Science

Delft University of Technology

June 2009

Page 2: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Graduation Committee:Prof. drs. dr. L.J.M. RothkrantzDr. ir. P. WiggersDr. C. Botha

2

Page 3: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Abstract

New ideas to improve automatic speech recognition have been proposedthat make use of context user information such as gender, age and dialect.To incorporate this information into a speech recognition system a newframework is being developed at the mmi department of the ewi faculty atthe Delft University of Technology. This toolkit is called Gaia and makesuse of Dynamic Bayesian networks. In this thesis a basic speech recognitionsystem was built using Gaia to test if speech recognition is possible usingGaia and dbns. dbn models were designed for the acoustic model, languagemodel and training part of the speech recognizer. Experiments using a smalldata set proved that speech recognition is possible using Gaia. Other resultsshowed that training using Gaia is not working yet. This issue needs to beaddressed in the future and also the speed of the toolkit.

3

Page 4: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

4

Page 5: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Contents

1 Introduction 71.1 Research Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Contents of the report . . . . . . . . . . . . . . . . . . . . . . 9

2 Standard Speech Recognizer Techniques 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . 122.3 Acoustic Preprocessing . . . . . . . . . . . . . . . . . . . . . . 142.4 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.8 More Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 DBN based automatic speech recognition 233.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . 253.1.2 Approximate Inference . . . . . . . . . . . . . . . . . . 273.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . 293.2.1 Exact Inference . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Approximate Inference . . . . . . . . . . . . . . . . . . 323.2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 From HMM to DBN . . . . . . . . . . . . . . . . . . . . . . . 33

4 Tools 354.1 Gaia Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 ObservationFile . . . . . . . . . . . . . . . . . . . . . . 374.1.2 PTable . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5

Page 6: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

4.1.3 XMLFile . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.4 DBN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 HTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 SRILM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 copy sets and test sets . . . . . . . . . . . . . . . . . . . . . . 41

5 Models 435.1 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.1 Context Files . . . . . . . . . . . . . . . . . . . . . . . 465.2 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 Word Uni-gram . . . . . . . . . . . . . . . . . . . . . . 475.2.2 Word Tri-gram . . . . . . . . . . . . . . . . . . . . . . 495.2.3 Interpolated Word Tri-gram . . . . . . . . . . . . . . . 535.2.4 Phone Recognizer . . . . . . . . . . . . . . . . . . . . 53

6 Implementation 576.1 PreProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1.1 FileManager . . . . . . . . . . . . . . . . . . . . . . . 606.1.2 IOConverter . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3 Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.4 GaiaXMLGeneration . . . . . . . . . . . . . . . . . . . . . . . 666.5 lexiconTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.6 createLMtext . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Experiments and Results 697.1 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . 697.2 Smaller Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 707.3 Recognition Experiments . . . . . . . . . . . . . . . . . . . . 727.4 Pruning Experiments . . . . . . . . . . . . . . . . . . . . . . . 737.5 Training Experiments . . . . . . . . . . . . . . . . . . . . . . 74

8 Conclusion and Recommendations 79

6

Page 7: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 1

Introduction

Computers are an intricate part of our life for most of us. Almost everyonein developed countries has a computer available for personal use. The waywe operate or communicate with computers is with a keyboard and mouse.This is not a very natural way of communicating for humans and we needto learn to use those control mechanisms. Furthermore, heavy use of key-board and mouse have led to people suffering from Repetitive Strain Injury(rsi) complaints which is another signal that it is not a very optimal way ofcommunicating. Speech on the other hand is a very natural way of commu-nicating for humans and we humans are very proficient with it. If computerscould understand what we are saying, would that not be an ideal way of op-erating computers? This is one of the reasons why there is research done inAutomatic Speech Recognition (asr). Humans and computers however, aretwo different things and asr is complicated to do correct.

asr research has been around for some time. The earliest researchstarted around 1936 [17], but at that time the main problem was the lack ofcomputer power. Computers became more powerful and in the 1980’s sys-tems were developed that could recognize single words. In the 1990’s systemswere developed that could recognize continuous speech with a vocabularyof a few thousand words and today we have systems that do continuousspeech recognition for 64k words with a recognition rate of about 95% onread speech. These results however, are obtained in a controlled environ-ment where the system is adapted to the speaker’s voice. At that recognitionrate the system is therefore very restricted. The goal of asr research is toeventually create a system that can perfectly recognize natural, fluent andspontaneous speech of anybody in non-laboratory environments in real-time.

The current asr systems make use of Hidden Markov Models (hmm)

7

Page 8: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

which are explained in the next chapter. Although these systems can havegood performance as stated above, there are new ideas for improving speechrecognition for which hmms do not have sufficient modeling power. Wiggersin [19] proposes to use context information in speech recognition to increasethe recognition rates. One of the ideas is to use user knowledge, for examplegender, age, dialect, and switch to specific speech models once these variablesare estimated. This seems a good approach because specific systems workmuch better than general systems. A hmm however, cannot incorporatemultiple models and when confronted with different speakers hmm basedsystems use techniques like speaker adaptation or speaker normalization toeither adjust the model parameters to the speaker or vice versa. Anothercontext idea is using topics and switch to a corresponding language modelwhere certain words are more likely to occur than others. To test thesecontext ideas a dbn toolkit is needed but because current toolkits thatwork with dbns cannot process the large amount of states and data thatare used in speech recognition a new toolkit is needed. This new softwaretoolkit is being developed at the mmi department of the ewi faculty at theDelft University of Technology. The outlines of this toolkit called Gaia arediscussed in [19] and in this report some details are described.

1.1 Research Goal

One of the goals for which the Gaia toolkit is developed is to create a dbnbased context dependent speech recognizer that should be able to competeand hopefully outperform a modern hmm based speech recognizer. Becausesuch a hmm system will have had years of developing and will have manytechniques implemented that improve performance, building such a systemwith the Gaia toolkit is no small task. My literature study [18] was thereforechosen to be a research on techniques that current hmm based speech recog-nizers use to improve performance, such that I got an idea what techniquesmay be useful for the dbn system and, if it was possible within the projecttime, implement a technique.

The first step toward the context dependent recognizer is creating thefirst working basic speech recognizer with the Gaia toolkit and thus to provethat the toolkit is capable of doing speech recognition. Because the Gaiatoolkit is not fully developed this also means that a lot of testing with theGaia toolkit will be done such that bugs are found and missing functional-ity will be added. During the project there was also the idea to participatewith the dbn speech recognizer in the N-Best competition. This competition

8

Page 9: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

was held among several universities and companies in the Netherlands andBelgium to see what the current state of affairs is on Dutch speech recogniz-ers [14]. However, the dbn system was not finished in time to participate.

To sum up; the goals of this thesis are:

• Research what makes hmms successful as a asr model.

• Design dbn models for the acoustic model, language model and train-ing part of a speech recognizer.

• Design, implement and test a basic speech recognizer using the Gaiatoolkit.

• Further develop and test the Gaia toolkit.

1.2 Contents of the report

In this thesis I will describe how the dbn speech recognizer is created andhow it performed. The basic speech recognition theory and techniques arecovered in the second chapter. The hmm model used in such a system isalso discussed there. This is useful to gain some understanding of an asrsystem and to compare it to the dbn model described in the next chapter. Inchapter four the external tools that were used are discussed, which is mainlythe Gaia toolkit. The dbn models that I designed for the speech recognizerare explained in chapter five. The next chapter discusses the c++ toolsI created to make the speech recognizer. The experiments that were donewith the speech recognizer are described in chapter seven. The results fromthose experiments are displayed in the chapter thereafter which is followedby the conclusion and a note to future work.

9

Page 10: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

10

Page 11: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 2

Standard Speech RecognizerTechniques

In this chapter I will give a short introduction to the theory of speech recog-nition and describe the main parts of a speech recognizer and the ‘standard’techniques that are often used. I will also describe the hmm model herebecause most current asr systems use this model. Furthermore, in the nextchapter the dbn theory will be discussed and because it is not that differentfrom hmm theory this chapter will be useful in understanding how the dbnspeech recognizer works. In a literature study I researched what ‘advanced’techniques hmm based systems use to improve performance and a summaryof that report is given here in the last section.

2.1 Introduction

Speech recognition can be summarized in one formula:

W = argmaxW∈L

P (W |O) (2.1)

Here the W stands for the recognized word (or sentence), the W for a wordfrom the language L and the O is the speech signal or the observation. Thusthe recognized word is the word from our language that has the highestprobability given the observation. The distribution in this form is hard toquantify because the random variables involved may have infinitely manyvalues, so with the help of Bayes’ rule we transform it into this formula:

W = argmaxW∈L

P (O|W )P (W )P (O)

(2.2)

11

Page 12: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 2.1: Simple first order Markov Model that models weather predic-tion(Rain or Sun)

Because P (O) is the same for all W we can simplify the equation:

W = argmaxW∈L

observation likelihood︷ ︸︸ ︷P (O|W )

language model︷ ︸︸ ︷P (W ) (2.3)

Here we have two new terms that can be better quantified. The probabilitythat the observation is an instantiation of the word can be calculated by theacoustic model (or ‘observation likelihood’). This acoustic model is coveredin section 2.6 about recognition. The prior probability that the word occurs(for example after another word in a sentence) can be calculated with thelanguage model and this is covered in section 2.5 about the language model.

2.2 Hidden Markov Model

In our language we can generate an enormous amount of possible wordsequences but we use only a small part that is correct according to ourgrammar. A speech recognizer uses this fact by limiting the possible wordsequences it can recognize according to some (usually simple) grammar. Itmodels this grammar using a Hidden Markov Model such that an algorithmcan calculate the best possible sequence. It is derived from a Markov Chainwhich is a stochastic process with the Markov property. A Markov propertyof order n means that the present state is dependent on a finite number n ofpast states and independent on all other states. A simple example of a firstorder Markov Model is shown in figure 2.1. It models a simplified weatherprediction. The weather can be rainy (R) or sunny (S) and the predictionof today is only dependent on the weather yesterday. The transition proba-bilities (or aij) for this model are shown in table 2.1. If it rained yesterdaythen it will be sunny today with a probability of 0.4.

The difference with a Hidden Markov Model is that a hmm adds hiddenvariables to this model. In the previous example of the weather prediction

12

Page 13: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 2.1: Probability table belonging to Figure 2.2

TodayRain Sunny

Y esterday Rain 0.6 0.4Sunny 0.2 0.8

Figure 2.2: Simple first order Hidden Markov Model that models weatherprediction by an observable newspaper (Wet or Dry)

the only variable was the weather which is observable. In a hmm the state ofhidden variables can only be determined by looking at observable variablesthat are being influenced by the hidden variables. Suppose for example thatyou are not able to leave your house and are therefore not able to observethe weather. The only clue you have about the weather is the newspaperyou find on your doormat every day which is wet or dry. The weatherprediction model now becomes a hmm and is shown in figure 2.2. The Rand S variable have become hidden and the new observable variables areWp (wet paper) and Dp (dry paper) and are shown in grey to differentiatethat they are observable. They do not influence the weather system but areinfluenced by it, so the connections are shown in grey to differentiate them.To predict what the weather will be we can only look at the newspaper. Tocomplete the prediction model we need an additional probability matrix thatspecifies the relation between the weather and the state of the newspaperand is specified in table 2.2. These probabilities are called the observationprobabilities or bi(ot).

In hmm speech recognition we would like to know the word that is spokenbut to the computer this is a hidden variable. It can only observe the acousticpreprocessed sound but by using this information from the speech signal itcan determine which word is spoken using the methods described below.

13

Page 14: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 2.2: Probability table belonging to Figure 2.2

NewspaperWet Dry

Weather Rain 0.9 0.2Sunny 0.1 0.8

2.3 Acoustic Preprocessing

Before we can use a speech signal for recognition we first need to extractcertain information from the speech signal and put it in variables to workwith. Which information should be extracted from the signal is a decisionthat affects the performance. The performance of the system is bounded bythe amount of relevant information extracted from the speech signal. Thereare however methods derived from human hearing characteristics that areproved to have good performance. A well known method mfcc, is discussedbelow. Another method that is often used is perceptual linear prediction(plp), which also uses knowledge about human hearing. For more informa-tion on plp see [6].

The most valuable information of the speech signal for speech recognitionis the way the spectral shape changes in time. To capture this informationthe signal is divided in small intervals, e.g., every 10 msec. This is donewith a window function, a common one is the Hamming-Window describedin [13]. By multiplying this function with the speech signal we get a shortspeech segment that are often chosen to be longer than the interval, forexample 25 msec. Doing this for the whole signal with different time indicesgives us time segments that overlap. This overlapping proves useful forrepresenting the global shape transitions in time. From these samples wecompute the discrete power spectrum. This is done by first using a DiscreteFourier Transform to compute the complex spectrum of each window sampleand then taking the square root of each sample. Because human hearingdoes not have a linear frequency resolution, a transformation on the powerspectrum is used to better suit human abilities. Humans have a greaterresolution in lower frequencies of the spectrum and a lower resolution in thehigher frequencies. The Mel scale captures this non linearity and can be usedto transform the power spectrum into Mel scale coefficients, see figure 2.3.To transform these coefficients back into the frequency domain an inversediscrete fourier transform can be used, but this is usually replaced by amore efficient cosine transform which does the same. The coefficients are

14

Page 15: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 2.3: A Log Power Spectrum (left) and the same spectrum Mel-Scaled(right)

now called Mel Frequency Cepstral Coefficients (mfcc) and are used directlyas input to the speech recognizer (the word cepstral comes from cepstrumwhich is the inverse of a spectrum). Often only the first 12 coefficientsare used and to capture the dynamics of the speech input also the firstand second derivatives are computed. These derivatives are computed asthe difference between two coefficients lying a time index t in the past andfuture, where the second derivative is computed as the difference betweenthe first derivatives. Finally the signal energy is computed as the sum ofthe speech samples in the time window and it is also computed for thederivatives. This brings the total coefficients for the 10 msec speech inputto 39 ((12 + 1) ∗ 3) and these are used as a feature vector for the speechrecognizer.

2.4 Acoustic Model

From the preprocessing phase we obtained real-valued feature vectors forevery time-slice of the speech signal. One recognition method is to compareeach vector to a database of feature vectors and choose the one which suitsbest. A standard feature vector from the database would represent part of aword or a phone (explained below). Comparing the feature vectors will thenresult in the best fitting word or phone from the language. But because thereis a lot of variation in the pronunciation of words there need to be a largenumber of vectors to represent that for each word. Furthermore, becauseoften large lexicons are used the number of vectors that need to be stored

15

Page 16: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

is too large and another method is preferred. This method is statisticaland instead of standard feature vectors a (multivariate) probability densityfunction (pdf) is stored. We can then compute how likely it is that thecurrent feature vector from the observation comes from the pdf. In earliermodels a pdf was created for each phone. A phone is a sound where wordsare build from, for example the word ‘parsley’ consists of the phones ‘/p//aa/ /r/ /s/ /l/ /iy/’. An example of a phone set for Dutch speech can beseen in table 6.2. To capture more detail of the speech, often a sub-phonemodel is used. Splitting up a phone sound in the beginning (on-glide),middle (pure) and end state (off-glide) enables that. For certain phones likethe plosive /p/ in put this proves useful because the release and stop of thisphone are different. The number of pdf models increases with this modelby a factor of three.

Another way of improving the phone model comes from the idea thatphones influence each other and that for example the phone /r/ is different inthe word ‘translate’, where it is between a /t/ and an /@/, and in the word‘parsley’ where it is between an /aa/ and /s/. This is called the tri-phonemodel because for each phone there is a different representation for all thepossible neighbor combinations it has. The word ‘translate’ consists of thetri-phones: /t+r/ /t-r+@/ /r-@+n/ /@-n+s/ /n-s+l/ /s-l+e/ /l-e+t/ /e-t/.The number of pdf models required to represent all the different tri-phonesis the number of tri-phones to the power of three. However, many of thosetri-phone combinations will never occur or are very rare (for example /t-t+t/) so often systems are created with a much lower number of tri-phonesby clustering similar sounding tri-phones together.

In this project the sub-phone model was used because the tri-phonemodel requires too much computational power. Because a pdf is used torecognize only one sub-phone and we want to recognize whole words we needa mechanism to tie the separate sub-phone recognitions together to createphone representations. This is done using Hidden Markov Models (hmm).A hmm is created for each phone, made out of it’s sub-phones. Such a hmmhas state probabilities (the pdf ’s) for each sub-phone and also transitionprobabilities for moving from one sub-phone to the next, or to itself, seefigure 2.4. The small nodes 1 and 5 in the figure are just there to maketying this hmm to other hmms easier.

For each word a hmm can be created by ‘gluing’ the hmm of it’s phonestogether. Such a hmm could look like figure 2.5.

16

Page 17: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 2.4: An example three-state HMM

Figure 2.5: An example HMM for the word ‘he’

2.5 Language Model

The language model is used to capture properties of the language spokenand to predict the next word or utterance. It assigns probabilities to se-quences of words by means of a probability distribution. In formula 2.3 it isrepresented by P (W ). A commonly used language model is the N-gram. AN-gram is a model that gives probabilities to sequences of words. It assumesthat the probability of a word W in a sentence is only dependent on it’s npredecessors. For example the bi-gram model (n = 1):

P (w1w2w3 . . . wn) = P (w1)P (w2|w1) . . . P (wn|w1w2 . . . wn−1) (2.4)

≈ P (w1)P (w2|w1) . . . P (wn|wn−1) (2.5)

When sequences of one word are used it is called an uni-gram (indepen-dent of previous words), more words give a bi-gram, tri-gram and four-gram.The probabilities are simply calculated by counting the occurrence of thesequence in a large corpus. One problem with this method is that if a se-quence is not present in the corpus it gets a zero probability even thoughthe sentence could occur. A solution to this problem is called smoothing,

17

Page 18: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 2.6: Figure 2.4 shown as a Hierarchical HMM

where a small probability is given to each sequence that is not in the cor-pus [4]. To keep the total probability equal to 1, probabilities of sequencesthat do occur are lowered. How these probabilities are calculated differs persmoothing method.

2.6 Recognition

In section 5.1 the acoustic model is discussed which gives the hmm modelsthat represent phones. In section 2.5 the language model is discussed thatmodels the sequence of words. To get a full asr system the models need tobe combined. A hmm is created for each word which is done by ‘gluing’ thehmms of its phones together as shown in figure 2.5. Instead of representingthis as a regular hmm we can also put this model in a Hierarchical HiddenMarkov Model (hhmm), see figure 2.6. The extra nodes 1 and 5 to helptie the hmms together are omitted. The hhmm fits with the idea that wealready use a hierarchy in the system (words are made out of phones whichare made out of sub-phones) and will prove to be a nice link to the nextchapter. The hhmm is traversed depth-first and when a hmm is finished itwill jump back to the point where it got initiated; for example when thelast sub-phone of /h/ has finished the model tracks back to that phone andcontinues to /i/.

To calculate the probability that a given speech signal is a sequenceof words represented by a path through the hmms an algorithm is used.

18

Page 19: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

for every time-slice t of the observation o dofor every state s in the word under consideration do

for every transition s′ specified in the HMM of the word doforward[s′, t+ 1]← forward[s, t] ∗ a[s, s′] ∗ b[s′, ot]

end forend for

end forsum all probabilities at time-slice t

Figure 2.7: The Forward Algorithm in pseudo-code

If done for all possible sequences then the most likely sequence is the onewith the highest probability. An algorithm that is used is a dynamic pro-gramming algorithm called the Forward algorithm, see figure 2.7. In thisalgorithm forward[s, t] stands for the previous path probability, a[s, s′] forthe transition probability derived from the language model and b[s′, ot] forthe observation likelihood derived from the pdf in that state. It calculatesfor all possible paths through the model a probability and sums all those atthe end to give a probability for the whole word sequence. This approachhowever, uses many unnecessary calculations for speech recognition becauseonly one path through the hmm will match the speech signal. Furthermorethe forward algorithm has to run for each sequence hmm separately. A smallvariation on the forward algorithm is the Viterbi algorithm, which replacesthe sum of all previous paths by the maximum of those paths. It calcu-lates only the probability of the best path and can be run simultaneously onall word sequences in parallel. The algorithm can be visualized as findingthe best path through a matrix where the vertical dimension represents thestates of the hmm and the horizontal dimension represents the frames ofspeech (i.e. time), see figure 2.8. Even with the Viterbi algorithm however,calculating the word probabilities simultaneously for all words can take longbecause all possible paths are still calculated. Many of those paths willhave low probabilities. Pruning can be used to discard these low probabilitypaths and keep the search space smaller without much loss in the quality ofthe solution.

During the search for the best word or sentence many calculations aremade, usually with small numbers and a lot of multiplications. This of-ten leads to numerical underflow because computers can only represent toa certain precision. A simple solution for this is to work with logarithmiccalculations which replaces a multiplication with a sum. This helps to rep-

19

Page 20: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 2.8: Viterbi search representation

resent much smaller probabilities because summation (or subtraction in thiscase since the logarithm of small numbers is negative) of small numbers leadto small results much more slowly than multiplication. Thus instead of a ∗ bwe use log(a) + log(b).

2.7 Training

Before we can actually recognize anything we first need to train the recog-nizer on data that is representative for speech data it will encounter. In thistraining the parameters of the hmms that are created beforehand are esti-mated. The algorithm that is often used is the Forward-Backward algorithm(also known as Baum-Welch). It trains the transition probabilities aij andthe observation probabilities bi(ot) iteratively by starting with an estimationand using this estimation in the algorithm to calculate a better estimation.Estimating aij is done as the expected number of transitions from state i tostate j normalized by the expected number of transitions from state i. bi(ot)is estimated by the expected number of times in state i at time t normalizedby the expected number of times in state i. These estimations are calculatedwith the help of a forward probability and a backward probability (hencethe name). A complete description including formulas can be found in [4].

2.8 More Techniques

In a literature study [18] I researched a number of asr systems to find outwhich other techniques, besides the ‘standard’ techniques discussed above,

20

Page 21: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

they use to improve performance. A number of techniques are commonamong those systems; feature extraction is usually done using mfcc features.Some form of Speaker Adaptation is often implemented to give the systemmore robustness for different speakers. Examples are Speaker Adaptationwhere the model parameters are adjusted to the observation or Speaker Nor-malization where the observation is first normalized before being used by themodel.The Viterbi algorithm is the most common decoding algorithm andis often combined with Beam pruning or Histogram pruning. Beam pruninguses the best probable path and prunes all other paths whose probabilitiesare not within a certain percentage of the best path. Histogram pruningsets a threshold to the maximum number of paths in the search space. Itorders similar paths into bins and prunes the lowest probable paths fromall bins such that the total number of paths is below the threshold. Thedecoding is often done using a Two-Pass approach; the first decoding passis not accurate but fast, the second pass can be more accurate because thesearch space has been reduced in the first pass. For a language model aN-gram such as the Context Dependent Tri-phone model is used most often.This model is discussed in section 2.4 as the tri-phone model.

Depending on the type of asr system you are building, some techniquesare more useful than others. If you are building a simple asr system thatonly recognizes a few simple commands the vocabulary size will not be verybig. That means the language model does not have to be big and thatenough training data can be found more easily. A State Clustering method,which groups pdf’s of similar states together to reduce the model size, isnot necessary in that case. When a large vocabulary is implemented for asystem that has to recognize continuous speech this is probably very usefulbecause of the lack of training data for all possible acoustics. If spontaneousspeech has to be recognized it might be useful to consider the recognition offiller words (like ‘uh’) and to exclude them from the sentence. Otherwise thegrammar represented by the language model will not work properly whensuch a word is encountered. When time is an issue in training, insteadof the accurate Forward-Backward algorithm a Viterbi approximation ofthat algorithm can be used that estimates aij as the most probable pathinstead of counting all possible paths and normalizing by them. If real-timedecoding is not an issue, language models can be used that are slower butmore precise, for example Tri-grams instead of Bi-, or Uni-grams. Pruningcan be discarded or wider beams can be used and if the Token Passingmodel [16] is used, which is essentially a different formulation of the Viterbi

21

Page 22: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

decoding algorithm using tokens, a larger number of tokens can be allowed.All these requirements are also related to the hardware. If faster hardwareis used, more computational power can be used in the same time-span anda more computational expensive technique can be used.

22

Page 23: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 3

DBN based automaticspeech recognition

Research is being done on how speech recognition rates can be improved.[19] proposes to use different sorts of context information in addition tothe speech signal but because hmms are not suitable to incorporate thisinformation he searched for a different model. [11] and [21] both proposedthe use of dbns because the expressing power of hmms is limited. Becausea hmm is a special case of dbn there is no loss of expressing power whenchanging to these models, it actually increases. Furthermore, because dbnsare used in more research disciplines there are a lot of good algorithmsalready available. A last advantage of using dbns is when you would wantto use a multi-modal system that uses for example speech (audio) and lipreading (video) inputs. This can be combined in a dbn model fairly easybecause it can handle different time scales for input more easy than a hmm.

3.1 Bayesian Networks

To describe what a dbn is and what techniques are available to work withthem I start by describing Bayesian Networks because they are a more gen-eral class of models. An introduction to dbns and their inference techniquescan be found in [10] and [19].

A Bayesian Network (bn) is a graphical model that represents the re-lations between a set of random variables, it represents a joint probabil-ity distribution. It consists of a directed, acyclic graph which shows the(in)dependencies between the variables and a set of probability distribu-tions that quantify those dependencies. The advantage of having a set of

23

Page 24: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 3.1: Example Bayesian Network that models a system that predictswhether it is cloudy or not given the state of the grass. It consists of fourbinary variables; C or whether it is cloudy or not, S which represent thesprinkler on or off, R which is whether it rains and W which representswhether the grass is wet or not.

probability distributions instead of one full joint probability distribution isthat it is often smaller to represent the set of distributions. The numberof probabilities in a distribution is exponential in relation to the number ofvariables, for n binary variables there are 2n probabilities. Because thereare often independence relations between the variables the individual vari-able distributions can be made smaller. This can be demonstrated using anexample bn from [10] which is shown in figure 3.1. This bn models a systemthat predicts whether it is cloudy or not given the state of the grass andconsists of four binary variables; C or whether it is cloudy or not, S whichrepresent the sprinkler on or off, R which is whether it rains and W whichrepresents whether the grass is wet or not. The joint probability distributionof this system would, according to the chain rule of probability, be:

P (C, S,R,W ) = P (C)P (S|C)P (R|C, S)P (W |C, S,R) (3.1)

But because of the conditional independence relations in the model this canbe simplified to:

P (C, S,R,W ) = P (C)P (S|C)P (R|C)P (W |S,R) (3.2)

where the set of separate distributions is smaller to represent than the totaljoint probability distribution.

24

Page 25: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Calculating the probability of one or more variables in a bn given someevidence is called inference. Just as the Viterbi or the Forward algorithmis used for hmms there are algorithms for bn’s. A few of those methods areexplained here briefly.

3.1.1 Exact Inference

The most simple and straight forward inference method is summing outirrelevant variables from the joint probability distribution. This is a basictechnique from probability theory called marginalisation;

P (XQ|XE) =∑XH

P (XH), P (XQ), P (XE) (3.3)

whereXQ is the set of query variables, XE is the set of evidence variables andXH is the set of variables that are neither in the query set, nor in the evidenceset. Referring to the example from figure 3.1; W is the evidence variable,C is the query variable and S and R are neither. This straightforwardmarginalisation however, is for most networks computationally very hard todo directly. The techniques discussed further make use of clever ideas tomake marginalisation possible on larger networks.

Variable Elimination Variable Elimination is a technique that makesmarginalisation more efficient by pushing sums as far as possible in thecalculation when summing out irrelevant variables. This is illustrated usingthe example from figure 3.1. We obtain the joint probability distribution:

P (W = w) =∑

c

∑s

∑r

P (C = c, S = s,R = r,W = w) (3.4)

which can be rewritten as:

=∑

c

∑s

∑r

P (C = c)P (S = s|C = c)P (R = r|C = c)P (W = w|S = s,R = r)

(3.5)=

∑c

P (C = c)∑

s

P (S = s|C = c)∑

r

P (R = r|C = c)P (W = w|S = s,R = r)

(3.6)The innermost sum is evaluated and a new term is created which needs tobe summed over again.

P (W = w) =∑

c

P (C = c)∑

s

P (S = s|C = c)T1(c, w, s) (3.7)

25

Page 26: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

T1(c, w, s) =∑

r

P (R = r|C = c)P (W = w|S = s,R = r) (3.8)

Continuing this way gives:

P (W = w) =∑

c

P (C = c)T2(c, w) (3.9)

T2(c, w) =∑

s

P (S = s|C = c)T1(c, w, s) (3.10)

Message Passing Instead of doing one variable marginalisation at a time,a technique called message passing calculates the posterior distributions ofall variables in the network given some evidence simultaneously. It is a gen-eralization of the forward-backward algorithm for hmms described briefly insection 2.7. The algorithm works only for tree shaped graphs because a cyclewould lead to evidence being counted double. It uses the fact that a vari-able in the model is independent from the rest of the model given its Markovblanket. A Markov blanket consists of the variable’s parents, its childrenand the parents of its children. Variables receive new information from theirneighbours in their Markov blanket, update their beliefs and propagate itback. When done for all variables this process will reach an equilibriumafter a number of cycles and result in updated probability distributions forthe entire model. Details on this algorithm can be found in [12].

Junction Tree Because the message passing algorithm only works fortree shaped graphs another algorithm has been developed that works formodels that include cycles. The Junction Tree algorithm [7] creates a newtree-shaped graph that defines the same joint probability distribution as theoriginal but this new graph enables the use of a message passing algorithm.This new graph is obtained by change of variables and the new variables inthe graph are cliques of the original variables. The new graph is called ajunction tree and if the connections between the variables are directed themessage passing algorithm can be used. Usually however, it is easier to ob-tain undirected connections and use an adjusted message passing algorithm.

MAP Because marginalisation is very hard to do for large networks theMaximum A Posteriori (map) technique uses the max operator to reducethe computational requirements. Using the example again, compared to thejoint pdf from equation 3.4, map calculates P (W = w) as:

P (W = w) = maxc

∑s

∑r

P (C = c, S = s,R = r,W = w) (3.11)

26

Page 27: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

The Viterbi algorithm is a special case of map where all summations arereplaced by max operators:

P (W = w) = maxc

maxs

maxrP (C = c, S = s,R = r,W = w) (3.12)

3.1.2 Approximate Inference

The problem with exact inference is that for many models this is computa-tionally intractable. Therefore methods have been created that approximatethe correct inference results but are much faster. Here I will describe somemethods briefly.

Loopy Belief Propagation A straight forward idea is to use the mes-sage passing algorithm on graphs even though they have cycles. This iscalled loopy belief propagation. Because evidence will be counted doublethis method may not converge to a result or may converge to a wrong re-sult. In practice however, this method gives good results because in somecases all evidence will be counted double such that the effect is canceledout [11].

Cutset Conditioning Another method called cutset conditioning [15] isto instantiate variables in the graph that break up cycles. New graphs arecreated for each value of this instantiated variable and the message passingalgorithm is run on each of them after which marginalisation is used tocombine the results. The downside for this method is that the number ofnetworks grows exponentially with the number of cycles in the network andwith the possible number of values for the variables that are instantiated.

Sampling Methods A number of methods exist that do stochastic sam-pling on the model. They generate a large number of configurations fromthe probability distribution and then the frequency of the relevant con-figurations is computed. This enables the estimation of the values of thevariables in the graph. Logic sampling is a simple approach which starts atthe root nodes and their prior probabilities and then follows the arcs of thegraph to generate values according to the conditional probabilities to get aconfiguration. This method is not very good because there will be manyconfigurations generated that do not match the evidence and therefore theestimation will take long. Importance sampling improves this method byweighting the values generated by the conditional probabilities according tothe evidence.

27

Page 28: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 3.1: Four cases of learning for Bayesian Networks

Structure Observability MethodKnown Full Maximum Likelihood Estimation (ML)Known Partial Expectation Maximization Algorithm (EM)Unknown Full Search through model spaceUnknown Partial EM + Search through model space

3.1.3 Learning

For Bayesian Networks both the structure and the parameters for the proba-bility distributions can be learned, although the learning of structure is muchmore difficult than the learning of parameters. Furthermore the graph canbe completely observable or it can contain hidden nodes which makes thelearning more difficult. These possibilities lead to four cases and four pos-sible learning methods for each case are given in table 3.1. Because in thisproject the structure of the model is known and it contains hidden variablesI will discuss the Maximum Likelihood Estimation as introduction and theExpectation Maximisation algorithm.

Maximum Likelihood Estimation When the structure of the model isknown and the variables are all observable, learning comes down to findingthe parameters of each conditional probability distribution that maximizesthe likelihood of the training data. If the training set D consists of Nindependent items the normalized log-likelihood is:

L =1N

m∑i=1

s∑l=1

logP (Xi|P (Xi), Dl) (3.13)

Assuming that the parameters of the variables are independent of each other,the contribution to the log likelihood of each variable can be maximizedindependently. For the training of the W variable from the example offigure 3.1 we just need to count the number of training events where thegrass is wet and divide them by all samples:

P (W = w|S = s,R = r) ≈ N(W = w, S = s,R = r)N(S = s,R = r)

(3.14)

where

N(S = s,R = r) = N(W = 0, S = s,R = r) +N(W = 1, S = s,R = r)(3.15)

28

Page 29: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

For multinomial variables, like in this example, learning is counting occur-rences. For Gaussian variables the sample mean and variance needs to becomputed and then linear regression is used to estimate the Gaussian mix-tures.

Expectation Maximization When the structure of the model is knownbut it contains variables that are not observable the Expectation Maximiza-tion algorithm is used. The idea of this algorithm is that if we somehowknow the values of the hidden variables, the learning would be easy like inthe ml algorithm. Therefore expected values for these variables are com-puted and treated as if they are observed. For the example equation 3.14becomes:

P (W = w|S = s,R = r) =E[N(W = w, S = s,R = r)]

E[N(S = s,R = r)](3.16)

E[N(x)] is the expected number of times that event x occurs in the trainingdata, given the current estimated parameters. It can be computed like:

E[N(x)] = E∑

k

I(x|D(k)) =∑

k

P (x|D(k)) (3.17)

where I(x|D(k)) is an indicator function that has value 1 if the event occursin training sample k and is 0 otherwise. With the expected counts theparameters are maximized and new expected counts are computed. Thisiteration leads to a local maximum of the likelihood.

3.2 Dynamic Bayesian Networks

Dynamic Bayesian Networks are an extension to bn’s which can representstochastic processes over time. The term dynamic from dbn is a bit mis-leading because usually dbns are not assumed to change their structure, al-though there are cases where this is possible. Because the dbn evolves overtime it is represented by two models; the prior and the transition model.A simple dbn extension of figure 3.1 is shown in figure 3.2. Only the Cvariable is connected in time which represents the fact that whether it iscloudy on time t depends on whether it was cloudy at time t− 1. The priorprobabilities are shown in table 3.2 and the transitional probabilities areshown in table 3.3.

29

Page 30: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 3.2: Example Dynamic Bayesian Network extended from Figure 3.1.Only the C variable is connected in time which represents the fact thatwhether it is cloudy on time t depends on whether it was cloudy at timet− 1.

Table 3.2: Prior probability tables belonging to Figure 3.2

CloudyYes No0.5 0.5

RainYes No

Cloudy Yes 0.6 0.4No 0 1

SprinklerYes No

Cloudy Yes 0.2 0.8No 0.8 0.2

GrassWet Dry

Rain Yes 1 0No 0.3 0.7

Sprinkler Yes 1 0No 0.4 0.6

30

Page 31: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 3.3: Transitional probability table belonging to Figure 3.2

Cloudy todayYes No

Cloudy yesterday Yes 0.6 0.4No 0.5 0.5

3.2.1 Exact Inference

In theory all inference methods discussed in the bn section can also workfor dbn but then the entire network needs to be unrolled for all time-slices.Even if that size is known beforehand, it will often not fit into the computersmemory and thus online inference methods were developed that process thenetwork slice by slice.

Frontier algorithm The Frontier algorithm uses a Markov blanket likethe message passing algorithm where all the hidden variables d-separate thepast from the future. When variables are d-separated in a Bayesian networkthey are independent. The Markov blanket moves through the network intime, first forward and then backward and is called the frontier. Duringits movement variables are added and removed from it, resulting in the fol-lowing operations. When moving forward a variable can be added to thefrontier when all its parents are in the frontier and this is done by multiply-ing its conditional probability distribution onto the frontier. A variable canbe removed from the frontier when all its children are in the frontier and thisis done by marginalizing it out. When the variable is observed the marginal-isation is skipped because its value is known. When moving backwards avariable is added to the frontier when all its children are in the frontierand this is done by expanding the domain of the frontier and duplicatingthe entries in it, once for each value of the variable. A variable is removedfrom the frontier when all its parents are in the frontier and this is doneby multiplying its conditional probability distribution onto the frontier andmarginalizing it out, again if the variable is observed this marginalisationcan be skipped.

Interface algorithm The Frontier algorithm uses all the hidden variablesin a slice to d-separate the past from the future. [11] shows that the set ofvariables that have outgoing arcs to the next time-slice already d-separatesthe past from the future and that the Frontier algorithm is thus sub-optimal.

31

Page 32: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

The Interface algorithm uses this set, ensures that it is a clique (where eachvariable is connected to all other variables) by adding arcs and calls it theinterface. The algorithm creates junction trees for each time-slice includingthe interface variables from the preceding time-slice. The junction trees canbe processed separately and messages are sent via the interface variables.

Islands algorithm Even though online inference methods store less in-formation than offline methods it is sometimes still too much to fit in thememory because all the forward messages need to be saved. Instead of sav-ing these messages it is also possible to calculate them at each time-slice.This saves space but increases the computational load enormously. TheIslands algorithm [20] chooses a point between these extremes by storingthe forward messages at a number of points. That results in a number ofsubproblems that are solved recursively.

3.2.2 Approximate Inference

Boyen-Koller The Boyen-Koller algorithm [1] approximates the joint prob-ability distribution of the interface in the Interface algorithm by representingit as a set of smaller clusters (marginals) of variables. The requirement thatall variables in the interface need to be in a clique is dropped. How accuratethe algorithm is depends on the number of clusters used to represent theinterface. Using one cluster is equal to exact inference, using more lowersthe accuracy but speeds up the algorithm.

Viterbi Like the Forward algorithm for hmms the inference algorithmscan also be used with a Viterbi approximation. When marginalizing thesum operators are replaced by a max operator. The idea behind it is thatthe most likely path will contain most probability mass. This is also calledMost Probable Explanation or mpe.

3.2.3 Learning

Learning in a dbn can be done using the em algorithm from the bn section.Instead of running it on the entire network it is done for each time-sliceseparately using the Frontier or Interface algorithm. It uses a forward passwhere it stores intermediate results and uses those during the backward pass.

32

Page 33: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 3.3: How a hmm relates to a dbn. A hmm is shown on the left and isunfolded in time. When this is folded back horizontally you obtain the dbnwhere each q state contains all three states of the hmm.

3.3 From HMM to DBN

Because a dbn is generalization of a hmm we can convert a hmm to a dbn.This can be seen in figure 3.3. A hmm is shown on the vertical axis and it isunfolded to the right where its states and the possible state transitions areshown. If you fold the states back horizontally the dbn model is obtained.The three states are now enclosed in one variable q which is shown unrolledfor three time indices. The q variable can be in any of the three states,although at time index 1 it starts in state 1 which means that the first timeindex it can be in state 3 is at time index 3. The o variable represents theobservations that are the input or evidence to this model.

In the previous chapter I showed a hhmm of the acoustic model used inspeech recognition in figure 2.6. The idea from that model is that speechrecognition can be seen as hierarchical where words consist of phones andphones consist of sub-phones. This model can also be converted to a dbnwhich will look like figure 3.4. It consists of the S variable which representsthe sub-phones or states of the phone, the P variable which represents thephones and the W variable which represents the words. The Fs, Fp and Fwvariables are switches that fire when a corresponding variable has reachedits end. This model will be discussed in more detail in chapter five and then

33

Page 34: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 3.4: The hhmm from figure 2.6 converted to a dbn. It consists ofthe S variable which represents the sub-phones or states of the phone, theP variable which represents the phones and the W variable which representsthe words. The Fs, Fp and Fw variables are switches that fire when acorresponding variable has reached its end.

it will also show that it can be simplified.

34

Page 35: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 4

Tools

This chapter will describe the tools which were used but not created in thisproject. It will mainly cover the new Gaia toolkit but also cover two wellknown tools briefly.

4.1 Gaia Toolkit

In this section I will describe the Gaia toolkit from a user perspective be-cause that is the way I used it. It will be a short description because theentire toolkit is too complex to describe in just one chapter. First a shortintroduction and global description will be given to give an idea why andhow the toolkit is built. After the overview some specific parts of Gaia thatI used are discussed in more detail. Those parts are also interesting forreaders who want to use Gaia to build a speech recognizer. It has to benoted however that the Gaia toolkit is still evolving so this information maybecome incorrect.

I wrote in the introduction chapter that [19] proposes to use contextinformation to improve speech recognition. He also states in that reportthat there is no current toolkit that can easily accommodate this. TheGaia toolkit is therefore created as a framework for probabilistic temporalreasoning over models with large state-spaces. It uses Dynamic BayesianNetworks and can, for example, be used for language modeling and speechrecognition. It is written in c++ and consists of 8 libraries as shown infigure 4.1.

The Base library contains among other basic classes the ObservationFileclass and its iterator class which is discussed below. The Text library hasclasses that help with textual processing which can be used in creating lan-

35

Page 36: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 4.1: The different libraries from the Gaia toolkit and their relations

guage models. Utilities is a library that contains classes that create xmlfiles, handle the file system, do logging and do xml parsing. The Math librarycontains the mathematical building blocks which are used in the dbn. Real,Domain, Probability and RandomVariable are some of the classes that canbe found in this library. Classes that implement different types of distribu-tions like Gaussian and Multinomial can be found in the Distributionslibrary. The JTree library contains classes that implement the JunctionTreealgorithm, which is used for marginalisation in the dbn. The probability ta-bles used in the dbn are created with the classes from the PTable library.It contains different sorts of tables including; SparsePTable, LazyPTable,DensePTable and DeterministicPTable. The most interesting library fora user is DBN. It contains the classes to create dbn objects and performtraining and inference.

The dbn models that can be created with the Gaia toolkit can consistsof multiple parts. Figure 3.3 shows two sets of variables which are labeled.Each set belongs to a so called slice in Gaia and can be repeated for a finiteor infinite number of times. In the figure the first slice will be repeated onceand the second slice an infinite number of times (or as long as the number ofobservations). The toolkit also allows to group the slices into chapters, which

36

Page 37: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

can also be repeated a finite or infinite number of times. This constructionallows for elaborate models such that, for example, different inputs withdifferent time scales can be observed.

4.1.1 ObservationFile

ObservationFile is a class that generates the Gaia observation format. Itis created with an InputSpecification which states in this project thatthe observations consists of 39 real values and one nominal value. Once itis created it can be filled with ObservationVectors that should contain 39Boost Variant objects and one value that is either zero or one. (Boost is afreely available collection of peer reviewed C++ libraries [9]). This variantconstruct makes it possible to use different data types in one observation.This should be useful when context information is used in the speech recog-nition system such that you can, for example, extend the 39 real values witha boolean that determines the gender of the speaker. The observation file isstored in a binary format for fast loading.

4.1.2 PTable

The PTable class is used to create the multinomial tables for nodes of themodel. First a Domain for the table has to be created. The domain spec-ifies which random variables are contained in the probability distributionthat is represented by the table. This domain then has to be filled withRandomVariables which have two indices to position them in the model anda cardinality (number of states the random variable can be in). When usedin a dbn the first index is a relative position to indicate where the variablesare situated in the dbn model compared to each other. The highest indexmeans the variable is a child node and is not a parent to any of the variablesfrom the domain. The lowest index means that the variable is a parent nodeand is not a child to any of the variables from the dbn. The indices inbetween order the variables in this ‘child - parent tree’. The second indexmeans the time-slice the variable comes from, 0 is the current time-slice, -1is one time-slice in the past. To clarify this, consider the example model infigure 4.2. This is part of a model I used which will be explained later on.Suppose we would like to create a table for the variable Nw. This variablehas four parents; W in the current time-slice and Fw, Nw and Fs all onetime-slice in the past. These parents are created as RandomVariables withthe following indices: W with indices (0,0), Fw with (0,-1), Nw with (1,-1)and Fs with (2,-1). Fs is the last in the tree because the Nw is indirectly

37

Page 38: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 4.2: Example model

(via P ) in between Fw and Fs.Once the domain has been filled the table can be constructed with the

domain. After that the table can be filled with ValueVectors which shouldbe filled with a probability and values ordered according to the domainspecification. Furthermore, these vectors should be added to the table insuch a way that the values are added from low to high. When this is donecorrectly the vectors can be put into the ptable with the push_back() func-tion. Otherwise the slower Add() function can be used but with large tablesthis takes considerably longer because the vectors need to be sorted. Toclarify this consider the example given above from the model in figure 4.2.The table created would have entries is this order: [Nw,W ,Fw,Nw,Fs]. Tofill the table using the fast push_back() function all variables have to befilled from low values to high. This means that the ValueVector [1,0,0,0,0]would follow [0,99,10,34,0] which would follow [0,99,10,33,0]. This can bedone using nested for loops in a program.

4.1.3 XMLFile

Models and distribution tables are stored in xml format in the Gaia toolkit.When the PTable is filled it can be written to a XMLFile object with asingle streaming operator to create an xml file which can be included inthe total dbn model. The XMLFile class is designed as an aid in creatingxml files and it supports the writing of tags, attributes and data with astreaming operator. Writing end tags does not need an argument, it closes

38

Page 39: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

the corresponding (last) start tag automatically.

4.1.4 DBN

The DBN class is used to create dbn objects. It is constructed from a templatewith a specific engine. The choices are a BoyenKoller of a Frontier enginewhich use those algorithms for inferencing.

It provides functions to load a dbn object from a file in xml format. Sucha file will usually contain multinomial tables generated with the PTable classbut the total model structure has to be created by hand. If a dbn modelgets updated by training the updated model can be written to a file in xmlformat.

The DBN class has a SetIslands function which gives some control of thememory usage of the Gaia toolkit during inference. The two argumentsspecify in how many parts the current multinomial table has to be split andafter how many milliseconds of observation duration. The idea is that whena multinomial table becomes too large to fit in memory during operation isit better to split it up into pieces that fit into memory than to do pagingbetween the memory and the hard drive. Another function that can speedup execution is SetPruningBeam() which sets the parameter, in percentage,for the pruning beam. Beam pruning uses a percentage beam around thebest path to decide which paths to prune.

To prepare the Gaia toolkit for training the StartLearning() and af-terward the EndLearning() function need to be called. The first functionis called with a phase as argument and thus these methods specify whichphase of the model is to be learned. In the dbn model, variables can beset to update only in a specific phase or phases. This can be used to trainspecific variables only or keep a variable static during a training session.

When training the model there is for each observation some context in-formation needed that specifies what phones are spoken. This informationneeds to be loaded and the function StartSequence() does that. It hasto be called separately for each observation and with argument the current.seq file. It will find all other files in the same directory with same name(but different extensions) as context information and loads those xml ta-bles. The function EndSequence() has to be called when the learning of theobservation is done to unload the context files.

The function Learn() will do the actual training from an observation.It will do a expectation maximisation algorithm. As arguments it takes twoObservationFileIterator objects, one for the begin position and one forthe end. The iterators are created with the observation file as argument

39

Page 40: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

and optionally a offset. This way it is possible to use only a part of anobservation to train on.

Because training is a computationally heavy operation the Gaia toolkitis able to store the training results of a small dataset and combine theresults later on. This way it is possible to split a training task up insmaller parts and run them simultaneously on different processors. TheWriteAccumulators() and ReadAccumulators() functions do this writingand reading of partial training results. Both should be called within theStartLearning() and EndLearning() calls. The write version takes as ar-guments two indices and the directory where to store the accumulation files.The first index should be the same for all partial results that need to ac-cumulate, the second index should be unique for every partial result. Theread version has similar arguments.

To do inference on the dbn model the MAP() function can be used ifthe dbn object has the Frontier engine. If the BoyenKoller engine is useda MPE() function is available but at this point this does not work yet. Thearguments of the MAP() function are two ObservationFileIterators asin the Learn() function, a result construct and a domain. The domaincontains the random variable which need to be observed (for example theword variable) and the result construct consists of a list of observations (onefor every time-slice) and the probability for this path. Similar as in the caseof learning, StartMAP() and EndMAP() need to be called before and afterthe call to MAP().

4.2 HTK

The Hidden Markov Model ToolKit [5] is a toolkit for building and ma-nipulating hmms and is used primarily for speech recognition although notexclusively. It is developed at the Cambridge University Engineering De-partment. htk consists of a set of library modules written in c. The toolsprovide sophisticated facilities for speech analysis, hmm training, testing andresults analysis. I used a tool called HCopy to do the acoustical analysis ofthe speech signals. It creates mfccs from the speech signal which I used tocreate the Gaia observation files. The reason for not doing the acousticalanalysis myself is that it would be far too much work and it is beyond thescope of this thesis.

40

Page 41: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

4.3 SRILM

The sri Language Modeling Toolkit [8] is a toolkit for building and applyingstatistical language models, primarily for use in speech recognition, statisti-cal tagging and segmentation, and machine translation. It has been underdevelopment in the sri Speech Technology and Research Laboratory since1995. The toolkit consists of a set of c++ libraries, a set of executable pro-grams which perform standard tasks and some scripts that perform minortasks. I used the ngram and ngram-count programs to create the lm andtwo scripts to get specific information out of the lm. Some language mod-els that I created with srilm were an interpolated Kneser-Ney discountedtri-gram model on the word level (which was eventually not used because ofthe large size) and a interpolated 5-gram on the phone level. These lm fileswere then used to create distributions for the Gaia model.

4.4 copy sets and test sets

These are two small programs I used that help splitting up large data sets.When combined in a script it can divide a data set in multiple parts whereeach part is a integer percentage of the total. Once that is done for onespecific file type it can search for other corresponding files with a differentextension and put those together.

41

Page 42: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

42

Page 43: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 5

Models

In this chapter I will describe the models developed in this project. Thosemodels can be separated in a language model and an acoustic model whichare described separate below. The complete basic model looks like figure5.1. The acoustic model part is shown in black, the language model part isshown in grey. In all of the models I created, the acoustic model part is thesame and only the language models differ. Figure 5.1 shows the model fortwo Gaia ‘slices’ to indicate the relations between the variables in time. Thenumbers next to the variable names indicate to which Gaia slice the variablebelongs. The model starts in the first slice and it moves to the second slicefor the second time-slice of the observation. The second slice is repeatedfor every time-slice of the observation after that. The reason that the firstslice of the model is different is because there is no history at the beginningwhich is used by variables in the second slice of the model. The meaning ofthe variables is explained in next sections.

5.1 Acoustic Model

For the acoustic model we represent each time-slice of speech by an observa-tion variable O. As the processed audio data contains 39 dimensional featurevectors, this O variable has for each of its states a corresponding 39 dimen-sional Gaussian probability distribution. With these features as input wecan calculate with statistics on these pdf’s how likely the time-slice obser-vation corresponds with each of these states of the O variable. Once this isknown each of these likely states corresponding to a small time-slice shouldbe fit into a larger model on the phone level and/or the word level. Howthe observation variable is linked to phone and word variables is described

43

Page 44: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.1: The basic dbn model with Acoustic model part indicated in blackand the Language model part in grey

44

Page 45: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.2: The dbn Acoustic model used for training the S, Fs, M and Ovariables

below.The acoustic model as used in the training of the speech recognizer is

shown in figure 5.2. This is not exactly the same as the core acoustic partshown in figure 5.1 because we need the extra variables to specify whichsounds are processed during training. Because we are only interested intraining the acoustics we pretend that each training sample consists of one‘word’ so that the dbn model can remain simple.

The Nw variable represents the position in the sentence. As each phoneoccupies a position the first phone is on position 0, the second phone on1 etc. The P variable represents the possible phones, each of its possiblestates corresponds to one of the possible phones I used for the annotatedtrain data. The value of P depends on the Nw variable only because it isknown from the annotation data. The table of P consists of a phone for eachposition of the sentence. This works the same in training and recognitionthough in training the word(s) are known and the probability table for P is

45

Page 46: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

thus very small. In recognition the words are not known and the probabilitytable will consist of the positions of phones in every word in the lexicon.

Because a sub-phone model is used to better represent speech the Svariable is introduced in the model. The sub-phone model states that aphone is made up of three sub-phones (on-glide, pure, off-glide). Thesesub-phones are represented by the S variable which has 3 states per possiblephone. Its probability table consists of the transitional probabilities betweenthese sub-phone states. The S variable therefore depends on its previousvalue, on P and on the variable Fs. A phone is finished if its last sub-phone has finished. When that happens the Fs binary variable is triggeredwhich signals to S, Nw and EOI. Either the next position in the sentenceis considered, or if EOI is also triggered, the end of the input is reached.The EOI variable is also a binary variable that observes a special flag inthe observation file that signals if the current observation has finished. Byusing this variable it is ensured that the best matching path through themodel always reaches the end of the model and does not, for example, stayin one state for all time-slices.

The O variable is linked to the P , S and M node. For each combinationof states from these variables the O variable has a state which consist of a 39dimensional probability density function. In the project there were 144 (3 *48 * 1) (S *P * M) combinations because the M variable was fixed to onestate. This M variable was introduced to incorporate tied mixture modelingin the system. A tied mixture system uses a single set of Gaussian pdf’sshared by all observation states (in this project those 144 combinations).Each observation state then uses a set of weights (mixtures) to create itsown specific pdf from this set. The M variable would contain these weights.

5.1.1 Context Files

During training it is known which word(s) and thus phones are uttered inthe observation. To learn those phones in a hmm model the separate phonehmms are just pasted together like in figure 2.5 and the learning algorithmis run on the resulting hmm. In the dbn model however, all possible phonesand states are represented by the P and S variable respectively in a singlemodel. We thus need to specify in the model which phones are being utteredin the observation. The solution for this in the Gaia toolkit is the use ofcontext files. For each observation file a set of context files is also loadedwhich contain the probability distributions for the following variables; Nw,P , and EOI.

46

Page 47: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

5.2 Language Model

A language model in speech recognition is used to better predict the sen-tences that are uttered. It specifies in what order it is likely that the wordsin a sentence appear by assigning probabilities to each word ordering. Lesslikely word orderings can then be pruned to reduce the search space andcomputing effort. At the end of the project it became clear that a largevocabulary combined with a complex language model was computationallytoo heavy for Gaia. I therefore used different language models to test thespeech recognizer with and these are described here. Although I started theproject with a complex language model that became gradually more simple,I describe them here in reverse order because it makes the complex modelsmore easier to understand. The data for all the language models came fromthe cgn data. I used the transcriptions of the entire cgn data to create alarge text on which srilm computed the language model statistics I neededto fill the Gaia probability tables with.

5.2.1 Word Uni-gram

The uni-gram model on the word level uses no information from the past tocalculate probabilities for the word variable. The probabilities are calculatedby just counting the word occurrences in a large corpus and depending onhow often a word occurs it will get a corresponding probability, high fora frequent word, low for a word that does not occur often. The model isshown in figure 5.3.

It looks the same as figure 5.1 except the two grey nodes. The W variablehas all the words from the lexicon as its possible states. The Nw variableis the same as discussed in the acoustic model section but its probabilitytable here contains all the positions of the entire lexicon because this modelis used in recognition. For every state of the W variable (for every word)it has a list of all possible positions inside that word and to which positionit should change given the W and Fw variable. The P variable depends onboth W and Nw and is explained in the previous section. The reason thatit depends on W is that the entire lexicon annotation is contained in thestates of P and it thus needs a value of W (word) and a value from Nw(position) to return the correct phone. The binary variable Fw is triggeredwhen the last phone of a word has finished, that is when Fs triggers whenNw is in the last position of the word. The EOI variable is only dependenton Fw which means that the end of input can only be reached after one ormore whole words have finished.

47

Page 48: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.3: Word Uni-gram in the total model.

48

Page 49: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

The variable Nw in the first slice just starts at zero for every state ofW . In the second slice of the model is depends also on W and on theprevious values of Fw, Nw and Fs. If Fs is triggered then Nw goes tothe next position, otherwise it will stay the same value as the previous Nw.If Fw is also triggered Nw is reset to zero. I used the same probabilitytable for the W variable in the first and second slice of the model. Becausethe W variable depends on the previous W and Fw it needs those twogrey ‘dummy’ variables from figure 5.2.1 in the first slice for compatibility.When Fw triggers the uni-gram probability table is used in the calculations,otherwise the W variable will copy its previous value. In the first slice ofthe model the dummy Fw is triggered and the dummy W holds no realinformation because it is not used.

In the experiments I did, the set contained only single word utterances.Therefore I also created a model which had no Fw variable because it wasnot needed and this would make the model a little less complex. With theFw variable removed the system will not consider utterances of multiplewords and thus only output single words. The model looks exactly likefigure 5.3 with the Fw variables replaced by the EOI variables and thedummy Fw variable removed.

5.2.2 Word Tri-gram

A commonly used language model in speech recognition is the tri-gram ona word level. The probability of a word depends on the previous two words.This model is proved by experiments to hold decent information to capturegrammatical sentences. I created a tri-gram model with the srilm toolkitand smoothed it with modified Kneser-Ney. Because this model is quitecomplex to construct in Gaia it is discussed here in two stages for clarity.The first stage as constructed in Gaia is shown in figure 5.4.

The tri-gram model in figure 5.4 works like the uni-gram model butbecause the W variable now depends on its previous values some extravariables are needed do the calculations right.

Because I use a tri-gram model, W is dependent on its values two time-slices in the past. Therefore three slices of the model are shown of which thefirst two are used once and the last one is repeated indefinitely. Furthermore,because I used the same W probability tables for the W variable in all time-slices the W variable in the first two time-slices needs previous values thatdo not exist. Therefore two dummy variables are added in grey which holdno real information. In order to not use this ‘wrong’ information the N andE variables are available. The N variable simply counts how many indices

49

Page 50: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.4: Part of word Tri-gram model

we can look back in time (0, 1 or 2) by updating its value only if Fw istriggered to a maximum of 2. The E variable signals the end of the sentenceand resets N to 0 if that happens. It is different from the end of inputvariable EOI because the input can contain multiple sentences. Dependingon the value of N , the W variable uses an uni-, bi- or tri-gram model anduses 0, 1 or 2 previous W variables. The W dummy variables are thus neverused but are necessary in the xml model for consistency. The third dummyvariable is the Fw variable which signals the W variable in the first slicethat a new word begins. If the Fw variable is 0 the W variable will stay inthe same state as the previous time-slice.

The distribution of the E variable comes also from the language modelcreated by srilm. For all uni-, bi- and tri-grams I filtered out the ones whichended in the ‘end of sentence’ symbol < /s >. I thus used the bi-grams withthe < /s > symbol to create uni-grams and tri-grams with the < /s >symbol to create bi-gram probabilities for the E variable.

The problem with the model thus far is that the W variable is updatedin the same time scale as the observations. For every 10 msec there willbe a new chapter of the model where the W variable will usually have thesame value as the previous chapter. The problem occurs when Fw triggersand the tri-gram (or bi-gram) probabilities need to be considered in thecalculations. The values of W which are used according to his model arethe W values of the previous two 10ms time-slices (which are usually allthe same) instead of the actual previous two words. We thus need a way tostore the actual words and this is done using the model in figure 5.5.

50

Page 51: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.5: Word Tri-gram language model

The difference with figure 5.4 is the addition of the −1W and −2Wvariables and the corresponding dependencies which are shown in grey formore clarity. These variables store the actual previous words by copying thevalue from W to −1W and from −1W to −2W when Fw triggers. The −1Wvariable and −2W start out in dummy states but because the N variablemakes sure that only after two Fw triggers the tri-gram probabilities areconsidered, these variables will also have correct values by then. When Fwis not triggered they copy their value to the next time-slice.

The tri-gram probabilities can now be used correctly because W dependsno longer on its previous values but on the −1W and −2W variables whenFw is triggered. When Fw is not triggered W still depends on its previousvalue because it copies that value to the next time-slice.

This model should now work if for all possible words combinations thereare bi- and tri-gram probabilities available. This is because once two wordshave been recognized only tri-gram probabilities will be used in the calcula-tions. This is also true for the E variable because this will also apply to the’end of sentence’ uni- and bi-gram tables created. Because they are obtainedfrom bi- and tri-grams ending with the < /s > word a sentence can onlyend with a bi-gram that ends in < /s >. It is therefore important that allpossible combinations are available. This can be achieved using smoothing.

51

Page 52: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Smoothing

Usually not all possible tri- or bi-grams occur in training data. Smoothing isused to give these unseen events (word orderings) a small probability insteadof zero probability. Because the corpus on which the language model istrained does not contain all possible word sequences, and some of those mayoccur during speech, we need to reserve some probability for those events bylowering the probabilities of seen events. Once we have probabilities for allword sequences we can fill in the entire probability tables for the W variable.

The smoothing algorithm used in the language model is Modified Kneser-Ney smoothing which seems the current best smoothing algorithm availableaccording to [2]. It is implemented by srilm and the precise algorithmcan be found in [2]. It is an extension to absolute discounting which is asmoothing technique that subtracts a small constant amount of probability(D) from each non zero event. It then distributes that total probability massevenly over unseen events. The amount D can be calculated by function 5.1where N1 stand for the total number of uni-grams observed and N2 thus forthe total number of bi-grams.

D = N1/N1 + 2 ∗N2 (5.1)

Kneser-Ney builds on the idea that the influence of lower-order distributions(like uni-grams) are only important if the higher-order distributions haveonly a few counts in the train data. Kneser-Ney smoothing therefore looksat the number of contexts a word appears in, instead of the number oftimes a word appears. [2] motivates this by the example of the bi-gram‘San Francisco’. Most smoothing methods will assign a too high probabilityto bi-grams that end in ‘Francisco’ because ‘Francisco’ has a high uni-gramcount. However, when ‘Francisco’ appears it is almost always after ‘San’. Toassign those other bi-grams with ‘Francisco’ lower probabilities the uni-gram‘Francisco’ should receive a lower probability. The bi-gram ‘San Francisco’will not be affected much because it has a high bi-gram probability.

There are two other techniques that also solve the problem of unseenevents in the training data; backoff and interpolation. Backoff uses a lowerorder n-gram probability when there is no higher order available. If a tri-gram probability cannot be found for the current word in combination withthe previous two words, a bi-gram probability is used. If that also cannotbe found the uni-gram probability is used. This approach is however notpossible in the Gaia toolkit due to the way Gaia is constructed. Interpolationis discussed in the next section.

52

Page 53: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

5.2.3 Interpolated Word Tri-gram

Interpolation is a technique that is used to better estimate the N-gramprobabilities of unseen N-grams. The idea follows from the next example.If the bi-grams ‘who are’ and ‘who art’ both do not occur in the trainingdata they can be given an equal amount of probability by smoothing. Thebi-gram ‘who are’ however, is much more likely to appear due to the word‘are’ being more common than ‘art’. Interpolation uses this informationby averaging over all available N-grams. The tri-gram P (Wi|Wi−1Wi−2)becomes:

P (Wi|Wi−1Wi−2) = λ1P (Wi|Wi−1Wi−2) +λ2P (Wi|Wi−1) +λ3P (Wi) (5.2)

For the bi-gram example given above this would make the total probabilityof ‘who are’ larger because the uni-gram probability of ‘are’ is larger than‘art’ (the bi-gram probability is the same for both). Interpolation can beused in conjunction with smoothing or without. For the example this meansthat the bi-gram probability is either a small smoothed amount or zero.

An interpolation structure can be created in the dbn model like figure5.6. The interpolation structure is shown in the model from figure 5.4 insteadof the model from figure 5.5 for clarity. The structure is very easy, it consistsof two λ variables, one which is connected to the W variable and one whichis connected to the E variable. The λ variable for the W variable will havethree values which are the weights for the uni-, bi- and tri-gram probabilitiesas in equation 5.2. The λ variable for the E variable works the same buthas only two values because the E variable works with bi- and uni-gramsonly.

The values for the λ weights can be obtained by training this model ondata while keeping all other variables static. This can be done easily withthe Gaia toolkit.

5.2.4 Phone Recognizer

Instead of recognizing words I also created a model that recognizes phones.The advantage of this model is that there is no lexicon needed which leadsto a smaller model and that it can recognize every possible word; it is notbounded by the lexicon. The disadvantage is that this also means that themodel can recognize non existing words and that the information obtainedfrom the lexicon (which phones are likely to follow each other) is not avail-able. This information can however be obtained by using a language modelon the phone variable. Because there are only around 40 phones used there

53

Page 54: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.6: Language model with Interpolation construction

is enough data in the cgn corpus to train a n-gram without smoothing.The small phone set also enables the use of larger n-grams because even atrigram would give only 403 = 64.000 possible combinations.

Figure 5.7 shows two slices of the model which is enough to explain howthe model works without the picture becoming too cluttered. The P , Fs,S, M and O variables are present like in the previous models. The languagemodel is built around the P variable like the W variable from the tri-gramword model. The −1P and −2P variables store the actual previous valuesof P in the tri-gram model and the N variable keeps track of how manyprevious P values we can look back in time, up to a maximum of two in thiscase. These variables are now dependent on the Fs variable because thissignals when phone has finished. When Fs does not signal P , N and all the−P variables copy their values to the next time-slice. If Fs signals the Nvariable increases its value (if possible), the −1P variable copies the valuefrom the P variable and the −2P variable copies the value from the −1Pvariable. The EOI variable is also connected to the Fs variable such thatat the end of the input only a number of whole phones can be recognized.In the next time-slice the P variable can according to the value of N use thecorrect previous values of P in the calculations of the inference algorithm.Interpolation is done using the same λ construction as described in theprevious section but on the phone level.

54

Page 55: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 5.7: Phone recognizer with Tri-gram

55

Page 56: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

56

Page 57: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 6

Implementation

In this chapter a description of the tools that I created is given. The fi-nal products are a number of separate tools because this modular designallows for flexibility and expansion possibilities. The classes PreProcessor,Trainer and Recognizer are thus not integrated into one program to makea full speech recognizer. For this project that is sufficient because it is notnecessary to do ‘real-time’ asr. The separation made it possible to run atraining cycle while working on another part.

The products are mostly developed platform independent on Windowsand Linux machines but are mainly tested on Linux because the fast com-puter available for this project was running Linux. Because debugging withMicrosoft Visual Studio was really helpful I also tested on a Windows ma-chine and therefore most of the tools should also work on a Windows ma-chine. The difference in environments between Linux and Windows madeit sometimes annoying to work with simple text output files because theend of line symbols are different for those environments. Some of the out-put files suffered from this and also the annotations of the Polyphone datawhich were created under Windows had problems with this. I should haveused xml files for the output files from the PreProcessor to be more systemindependent but time constraints have kept me from doing that.

6.1 PreProcessor

The PreProcessor tool is used to pre-process data before it can be used bythe Trainer or Recognizer. It can cut long audio data in smaller pieces andconverts data to the formats that Gaia uses. This cutting is helpful becausethe Gaia toolkit works faster on smaller pieces. It has 2 helper classes which

57

Page 58: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

are described in the next sections. How these classes relate to each otherand what functions they provide is shown in figure 6.1.

Inputs As input it supports the same audio input as htk because it useshcopy to generate .mfc feature files from the audio data. Those .mfc filesare then converted to .seq files which is the format Gaia uses (although the.seq file extension is not specified for the Gaia toolkit). This format is a GaiaObservationFile object which is written to a binary file. The informationflow for this conversion is shown in figure 6.2. As input for cutting the audioand annotation files it supports the .skp files from the cgn data. It thereforeonly supports cutting for the cgn data. Functionality is created to read the.uem files for cutting the N-Best data but this has never been tested. Asannotation input it supports the cgn .awd and .wrd annotation files and the.lab files from the Polyphone data. Those files will be converted to the .saf(Simple Annotation File) format which is used to create the Gaia xml filesfrom. The .saf format is kept simple such that more conversion possibilitiescan be added. It consists of 4 lines in a text file; on the first line it states thenumber of words, the second line contains the words, the third line containsthe number of phones and the last line contains the phones.

Configuration File The tool uses a configuration file to specify a num-ber of options which are shown in table 6.1. For all the input and outputfile formats a root directory can be given which either contains the inputfiles or specifies where the output file should go. More on these directoriesis discussed in the section of the FileManager class. The OVERWRITE op-tion specifies whether the program overwrites existing output files or skipsprocessing the input file if the output file already exists. If an input fileis corrupt and a exception occurs during processing this input file can beremoved from the directory with the REMOVE option. It is not necessary todo this because at each processing step a new list of input files is createdsuch that unprocessed files will not be missing. The PMAP option specifieswhere the phone mapping file is located. This file is a xml file where eachpossible phone is mapped to an index. The location of the similar lexiconmapping which maps each word to an index is specified with the LMAP op-tion. These mappings are needed to generate the correct Gaia context filesfrom the annotation files. These context files are needed in training. For thesame reason the options N_PHONEMES, N_WORDS, N_STATES and N_MIXTURESspecify the number of these variables respectively. Because different audiodata needs different parameters for processing by htk, the configuration file

58

Page 59: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 6.1: Classes of the PreProcessor tool

59

Page 60: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 6.2: Information flow for conversion from audio to the Gaia format.hcopy is used by the PreProcessor to convert the audio to a feature file

used by htk can be set by the option HTKCONFIGSTRING.

6.1.1 FileManager

FileManager is a class which helps to keep the file managing (where to putinput and output files) separated from the other classes. It keeps track offile locations with a map container that uses the filename without extensionas key, and the file location path as value. It has a single file get methodand an iterator get method which is used to access all files one by one.

All functions from the PreProcessor require that the map containerin the FileManager object contains the locations of the files that need tobe processed. To load the container a BuildFileList() method is called.Because this function is very fast compared to the pre-processing steps itis called for every conversion step separately. This allows the use of thepre-process functions separately and also makes it easy to remove corruptfiles from the data in between the pre-processing steps.

FileManager is initiated with the root directories for the different fileformats (written in the configuration file); one for the audio files, one for the.mfc files etc. It then copies the underlying directory structure of the audioroot directory to the other root directories. Whenever a file is processed theresulting file is placed in the same relative place but under the appropriateroot directory. In this way all the files corresponding to the same audio fileare placed in the same relative place which simplifies the file managing. Thismethod of storing the files was preferred because we are usually workingwith large amounts of files so putting everything in one directory wouldmake things cluttered. Furthermore, no checks are performed to see if thefile has the right extension, all files in a certain directory will be processed.Therefore it is important that the directories used for different file formatsare different. Before training only the audio and annotation directories exist,the other directories can be created automatically.

60

Page 61: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 6.1: Configuration commands for the PreProcessor

AUDIOROOT Location of the audio files to be processedMFCROOT Location where the .mfc files should goSEQROOT Location where the .seq files should goAWDROOT Location of the automated cgn annotation filesWRDROOT Location of the hand-annotated cgn filesLABELROOT Location of the Polyphone annotation filesSAFROOT Location where the .saf files should goPTABLEROOT Location where the Gaia context files should goSEGROOT Location of the segmented audio files to be processedSKPROOT Location of the automated annotation cgn

files used for cuttingSKPWRDROOT Location of the hand-annotated cgn

files used for cuttingAUDIOCROOT Location of the cut audio files to be processedANNOTCROOT Location of the cut cgn annotation

files to be processedOVERWRITE With this option TRUE it will overwrite

existing output filesREMOVE With this option TRUE it will remove

input files if it cannot process themPMAP Location of the phone mapping fileLMAP Location of the lexicon mapping fileN PHONEMES The number of phones used in the systemN WORDS The number of words used in the systemN STATES The number of states used in the systemN MIXTURES The number of mixtures used in the systemHTKCONFIGSTRING Location of the htk configuration file

61

Page 62: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

6.1.2 IOConverter

IOConverter is a class that has methods for conversion between different fileformats. Because the Gaia engine has a specific input format this conversionis needed. At this time the class supports the following conversions:

• Audio to a htk feature file (with .mfc extension)

• htk feature file to Gaia Observation File (with .seq extension)

• cgn annotation file in either .awd of .wrd format to .saf format

• Polyphone annotation file in .lab format to .saf format

• .saf format to four Gaia context xml files

For the audio conversion htk is used. A configuration file is needed torun the HCopy program in which the input and output format is specified.The input format should match the audio file specifics and the output formatcan be chosen freely because the tool should be able to read other formatsaccording to the htk file header. In the acoustic model I used 39 mfccfeatures as observations every 10 ms for a 25 ms interval. The overlappingin time should help to better capture the global shape of the speech signal.The 39 entries for the feature vectors are the same features as described inthe second chapter of this report. As a 40th feature I added a boolean valueto each time-slice indicating that it is the last time-slice of the observation(value 1) or not (value 0). This feature is observed by the end of inputvariable in the model.

When converting the cgn annotation files and Polyphone annotation filesto the .saf format the annotations are checked if they contain any phonesor annotated symbols that do not occur in the phone set I use. These setsare shown in table 6.2 which contains the cgn phone set plus a symbol forsilence and a garbage symbol for noise and the Polyphone set. The tableis in Dutch because it describes phones for the Dutch language. The Poly-phone set contains less phones than the cgn set so for some cases the samephone is used. All the unknown symbols in the annotations, which usuallyrepresent background and mouth noise, are changed into a noise symbol.This is done such that in training these sounds will not be incorrectly seenas data for a normal phone. Furthermore, when a word ending in a phonewhich is the same as the first phone of the next word and those phones arepronounced as one sound, this is annotated with a underscore symbol in thecgn annotations between those phones. For the .saf file this is replaced byjust one symbol of that phone.

62

Page 63: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 6.2: cgn and Polyphone phone set

Klasse Voorbeeld CGN-symbool Polyphone-symboolPlosieven put p p

bas b btak t tdak d dkat k kgoal g g

Fricatieven fiets f fvat v vsap s szat z zsjaal S shravage Z zjlicht x xregen G ggeheel h h

Sonoranten lang N nnmat m mnat n noranje J nn jlat l lrat r rwat w wjas j j

Korte vocalen lip I ileg E elat A abom O oput Y y

Lange vocalen liep i iebuur y yyleeg e eedeuk 2 eulaat a aaboom o ooboek u u

Swja gelijk @ at

63

Page 64: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Diftongen wijs E+ eihuis Y+ uikoud A+ ou

Leenvocalen scene E: ehfreule Y: euhzone O: oh

Nasale Vocalen vaccin E ehcroissant A aconge O oparfum Y uh

Extra Symbols silence sil silnoise #background noise bgsmouth noise mn

The four Gaia context xml files that are created from the .saf file arespecific to the acoustic model I used in this project. They are needed fortraining the model and represent the following. The .tb1 files specify whichphones are spoken in the audio file on which position and correspond to theP node in the model. The .tb2 files dictate what the values of Nw are giventhe previous Nw, Fs and Fw (when we move to the next position or phonein the file). The .tb3 files state in which state the Nw node starts, whichis always in position 0. The .tb4 files dictate when the EOI node shouldtrigger and signal the end of the input. Because the annotation also containssilences, which are modeled by a ‘sil’ phone, it is no problem to act as if thetotal file only contains one ‘word’.

6.2 Trainer

This class does the training of the dbn model. As inputs it uses the dbnmodel in xml format and the root directory where the training data islocated. Furthermore it reads training options from a configuration fileand an output xml can optionally be specified, otherwise it will overwritethe input xml file. The options read from the configuration file are theGaia engine, the training phase and the islands values. For training onlythe BoyenKoller engine is currently supported. The training phase is avariable that can be set in the dbn model for every variable. It specifiesin which phases its multinomial tables can be updated by training. It is

64

Page 65: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

thus possible to train one variable at a time while keeping the others static.The islands values can help to tune the performance of the Gaia toolkitby dividing up the large multinomial tables used in the calculations. Thisfirst value specifies in how many chunks the tables will be divided and thesecond value specifies after how many milliseconds of input speech that willhappen. When there is little memory available this can help avoid a lotof data swapping to the hard drive. The test machine used in the projecthad enough memory available so a value was chosen (30 sec) to fit every fileentirely. The default is to never divide the tables.

Training can also be done using accumulators. This enables several in-stances of the program to run simultaneously on a part of the data and writetheir output to the accumulator files. These files can later be combined toform the total output. This was very useful during the project because theGaia toolkit only uses a single processor in training while I had eight pro-cessors to work with. This option is enabled by the -A parameter with asarguments the directory where the accumulators will be stored (that direc-tory can be the same for all instances of the program), an index to indicateto which training session the results belong (this index is the same for all in-stances of the program) and an index to differentiate between the instances.I used the tools described in section 4.4 to split the data in small sets foreach instance.

The Trainer class uses the Filemanager class to obtain an iterator forall the files it needs to process. It only processes the .seq files in the directory,the corresponding context files need to be in the same directory as the .seqfile. The extensions of the context files are given in the train xml modeland in that way they are linked to the variable in the model. If a file cannotbe processed it will be skipped.

6.3 Recognizer

The Recognizer class does recognition with a trained dbn model. As inputsit uses a dbn model in xml format, a configuration file with options anda root directory of files to process. All these options work the same aswith the Trainer. In the configuration file the engine is currently restrictedto the Frontier engine because the BoyenKoller engine does not have theappropriate algorithms implemented. Output is written to the log file, whichcan be processed afterward to create a more readable output and to checkdirectly how many of the files were correctly recognized. A prerequisite forthis is that the word mapping file or phone mapping file is given as input.

65

Page 66: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

These specify which word or phone correspond with the indices given asoutput by the Gaia toolkit. To speed up the recognition process pruning canbe enabled with the trade off that a lower recognition rate can be expected.The pruning parameter can be anywhere from 0 to 1 where a value of 0means that no paths are pruned and a value of 0.5 means that 50% of thepossible paths are pruned.

6.4 GaiaXMLGeneration

For the Gaia toolkit a tool was planned to create any entire dbn model givensome model specification. A problem with this idea was how to represent amodel to the tool which then has to create an xml version of it. Because thexml structure which is used to specify the model is already fairly compact,it was decided that users would have to create the basic xml structurethemselves. Creating a tool that could create a dbn from any xml structurewas still considered to be a too large of a task for this project. The toolI made has ’crude’ functions to help create the Gaia model defined in xmlformat. It only facilitates the creation of the dbn models described in thisreport in chapter 5, although some variables (number of words for example)from the models can be changed. It is implemented as a single programthat has a list of functions that can generate multinomial tables for the xmlmodel given some input which can be given via the configuration file.

GaiaXMLGeneration can create and load mappings for the lexicon andphones which are needed because in the Gaia tables only numeric indices canbe used. The functions that generate the tables are sorted into tables for thetraining part and for the recognition part. The tables are created accordingto the variable descriptions from chapter 5 and as discussed in section 4.1.2.Depending on the language model that is used (specifically the number ofwords and the order of the n-gram used) they can become very large. Thisis also the reason why the xml model can consist of multiple files which areincluded in the main xml file such that the small main file is easy editableby hand.

One function that is noteworthy is the one the creates the default tablefor the O variable. The default table for all observations before training iscalculated by taking a subset of the data (cgn or Polyphone) which con-sists of about a 1000 files. By calculating the mean and variance of allthe observations in those audio files we get another 39 dimensional Gaus-sian distribution as result which is outputted in a xml file. Using thesedefault values instead of standard values should give a small boost to the

66

Page 67: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

time needed to train to model properly.

6.5 lexiconTool

The reason why I created a separate lexicon file and not used the cgn versionwas because of the operations I had to do on the lexicon; removal of words,sorting to size and loading it into a data structure in memory. These thingswere done easier with the simple lexicon file and it made no sense to storethe unused information.

This small tool has a number of functionalities that can be used on thelexicon. It is used to convert the cgn lexicon file (the text version) to asimpler version of the lexicon, removing information that was not used inthis project. This simple version has for each word one line containing theword, its phonetic transcription in dutch, in Flemish normal and in Flemishformal.

Example: slepen—s l e p @ —s l e p @ n —s l e p @ —The phonetic symbols used in the cgn files are discussed in table 6.2.

In this first version of the dbn speech recognizer I used only the Dutchtranscriptions in creating the dbn model. When adding a dialect contextvariable to the dbn model the Flemish transcriptions can also be used. Theother options are:

• The removal of all doubly listed items from the lexicon. These entriesare probably due to different part-of-speech tags for the same words.Because the transcriptions of those double entries are the same it doesnot matter which entry is removed.

• The removal or keeping of words from the lexicon given a part of thecgn frequency list. Because the lexicon is not assumed to be orderedthis operation has an exponential time complexity given the size of thelexicon. The total cgn lexicon size is about 150.000 words (withoutdouble entries). For this project different lexicon sizes were used soonly the words that occurred most often were kept in these lexicon.The input is either a list of words to keep, or a list of words to remove.

• Sorting the lexicon according to the size of the dutch phonetic tran-scription (number of phones). This step is necessary because the multi-nomial tables in Gaia need an ordering in their indices from low tohigh. Because in a certain multinomial table (the P node) the lexiconinformation is contained it was necessary to order that informationsuch that those tables could be generated relatively easy and fast.

67

Page 68: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

• Removing the phonetic transcriptions from the file leaving a word list.This (sorted) word list was used to create the lexicon mapping, whichmapped each word in the lexicon to an index in the multinomial.

6.6 createLMtext

This small tool is used to create a text file to train a language model on.Its input is a directory where it recursively reads all .saf (simple annotationfile) files and writes each sentence to a line. Because the functionality didnot fit in with the idea behind the other tools it is implemented separately.

68

Page 69: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 7

Experiments and Results

This chapter discusses what experiments where done during the project andwhat the results of those experiments are. The experiments are describedin chronological order to explain how the project developed over time. Allthe experiments were done on a computer running the Ubuntu Linux distri-bution. The computer specifications can be found in table 7.1.

7.1 Initial Experiments

I started by training the acoustic model (from section 5.1) on a part of thecgn data because the Gaia toolkit was not fast enough to train the entirecgn dataset in a reasonable amount of time. The ‘comp-o’ part of the cgndataset was chosen to train the model because that part contained the bestand clearest speech (read speech). Furthermore, only the Dutch part ofthat dataset was used because I wanted to create a simple recognizer andalso using the Flemish dialect would make the system more difficult to trainproperly. This Dutch part of the dataset consists of around 48.000 audiofiles in .wav format with a total size of 5,4 gb. 80% was used to train themodel, 10% was reserved for testing purposes and the other 10% to evaluate

Table 7.1: Test machine specifications

Processors 2 x Intel Xeon E5430 (Quadcore @ 2,66 Ghz)Memory 16 GBHard Drive Samsung F1 HD753LJ (750 GB, 32 MB Cache, 7200 RPM)

69

Page 70: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

performance. To speed up the training I cut the audio and annotationfiles into smaller pieces of one sentence for each file using the detailed cgnannotation files. These detailed files enabled cutting to be done preciselysuch that annotation and audio matched. It is faster to use more smallerfiles than fewer larger files because the Gaia toolkit stores an intermediateprobability table for each frame when using a forward-backward approachso more memory is needed for longer fils. Furthermore, as those tablesare processed the calculations become increasingly more hard and thus costmore time. To speed the training process up even more I split the trainingdata set in 8 parts as described earlier in this report and divided those setsamong 8 threads of the training program. The training took about 3 daysfor each training cycle to complete.

Once the model was trained for 10 cycles I tried to test the recognizeron some of the test files from the ’comp-o’ dataset. The idea was to use a18.000 word lexicon which consisted of all the words in the cgn corpus thatoccurred five or more times. The tests showed that even with a uni-gramword model the time it took to process just one file was too long (a fewhours) to be used for testing a large testset. Eventually, after decreasing thesize of the language model, I was able to process a file in a reasonable amountof time (less than one hour) using a lexicon size of the 1000 most commonwords. However, because this lexicon was so small testing the model on arandom set of files from the comp-o set would give very bad results becausemost words from the files would not be in the lexicon.

7.2 Smaller Data Set

Because the Gaia toolkit was too slow for a large vocabulary I decided touse a smaller and different set to test the recognizer on. This set is a partof the dutch Polyphone data set, which is a corpus consisting of telephonespeech [3]. This corpus consists of 8 cd’s of dutch telephone speech. Asthere was one cd missing the remaining corpus contained around 24.000audio files with a total size of 2.3 gb. The subset I used to test on consistedof 181 files in which just one word is spoken in each file. The lexicon for thisset consists of 150 words. This set has been used for a practical for the coursein speech recognition on the TU Delft. To test the dbn speech recognizeragainst, I used a hmm based recognizer which is also used in this practicalassignment as a reference. This asr is created and trained with htk anduses a word-loop where each word is preceded (optional) and followed by asilence and where each word has the same probability (uni-gram). A version

70

Page 71: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 7.1: The dbn model used in the recognition experiments.

of the Viterbi algorithm is used to calculate the the most likely word(s). Thetraining data for this recognizer was the part of the Polyphone corpus thatcontains only sentences.

Because the data in the ‘comp-o’ part of the cgn data was not verydissimilar from the Polyphone set I used the trained model and retrainedit on the new data. After doing 2 training cycles the results of a test were0%, so I did 10 more training cycles which took about a day for one cycle tocomplete. A test on the practical set showed that the results were still 0%.Inspection of the parameters led me to belief that those were not correct. Itseemed that during training possible paths through the model kept lingeringin some state in the middle of a word, and did not go all the way to theend of the model and thus through all states of a word. The solution to thisis was adding a new EOI variable, which is discussed in section 5.1, to themodel. This variable observes a new feature, a binary flag that signals theend of the input, that was added to the observation files. This constructionmakes sure that the model reaches the end state during training and thatduring recognition only paths who finish a complete word can be the result.

71

Page 72: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

7.3 Recognition Experiments

Since training a new model (with the EOI variable) from scratch takes sometime I first wanted to test the inference part of the speech recognizer sep-arately. I copied trained parameters from the reference system used in thepractical to the dbn model shown in figure 7.1. Those trained parametersrepresent the pdf parameters from the O variable, the state transition pa-rameters used in the S variable and the state end parameters from the Fsvariable. The model differs on two points from the standard model explainedin chapter 5. The M variable is removed because no mixtures were used inthe experiments and the link from the P variable to the O variable is gonewhich simplifies the model a bit when creating it but does not change theworking. Instead of having 3 states (S) for each of the 44 phones (P ) anda Gaussian pdf (O) for each combination of those, the changed model has132 (3 x 44) states where each phone has 3 of those states and there is oneGaussian pdf for each of those states.

Results of the experiments discussed here are presented in table 7.2. Themeasured times are approximates because there was no guarantee that theexperiments were performed under identical conditions. The test machinewas being used by multiple persons and sometimes there was no full pro-cessor power available. However, this was taken into account when makingthe approximations. Using the parameters from the htk system the teston the practical set had a 10% (18/181) recognition rate. Because the sys-tem used in the practical had a 77% (140/181) recognition rate there wassomething different in the dbn system. As the speech in the audio files didnot begin and stop immediately at the beginning and end of the file, addingsilence phones to the beginning and end of each word annotation seemed agood idea. Using this language model I obtained a recognition rate of 75%(135/181).

The small difference in recognition rates between the dbn system andthe htk reference system can be explained by the fact that the htk systemallows the first silence to be skipped and that some files do begin immedi-ately with the word being uttered. To test this I first enabled the recognitionof multiple words by adding the Fw variable to the model which is explainedin section 5.2.1. This resulted in the same recognition rate of 75% as themodel that can only recognize single words. This result shows some ro-bustness because the task difficulty increased while the results stayed thesame. To further improve the model the silences from the beginning andend of the annotations of each word were removed and a silence was addedto the lexicon as a separate word. The results from this experiment are 0%

72

Page 73: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 7.2: Results practical experiment

System Recognition rate Approximate Timetaken for one file

htk reference system 77% 2 secdbn system with parameters 10% 22 minutescopied from reference systemdbn system with parameters 75% 22 minutescopied from reference systemand silence in annotationsdbn system with parameters 75% 22 minutescopied from reference systemand silence in annotationsand multiple wordsdbn system with parameters 0% 17 minutescopied from reference systemand silence in lexiconand multiple words

recognition rate.

7.4 Pruning Experiments

To test the pruning technique implemented in the Gaia toolkit I ran the testusing the dbn system with the pruning parameter enabled. This dbn systemhas the trained parameters from htk and silence added to the annotations.The values I tested and the results are shown in table 7.3. It shows thepruning value used, the recognition rate, how many files gave a result andan approximate time it took to complete one file. For experiments wheremost files did not give a result the approximate time is the average of thefiles that did give a result.

The first value tested was 0.5 which means that paths that have a prob-ability that is less than 50% of the best possible path are pruned. This valuewas too strong because almost all files resulted in no path. This result canhappen because during the forward pass of the inference algorithm all pathsthat go to the end of a word can be pruned such that the EOI variable can-not be matched. Next, values of 1.0 ∗ 10−3 and 1.0 ∗ 10−13 were tested and

73

Page 74: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

although the recognition rate was still zero, the number of files that gavesome result increased. Because there was improvement when using lowervalues I then tested a value of 1.0 ∗ 10−100. This experiment resulted inalmost the same recognition rate as the experiment without the pruning pa-rameter but the measured time the experiment took did not improve. Thedifference in recognition rate lies in the fact that the pruning experimentresulted in multiple words in a few cases. Although the correct answer wasamong them, these outputs were counted as incorrect. Experiments witha value of 1.0 ∗ 10−75 and 1.0 ∗ 10−50 resulted in more files with multipleoutputs. As by this time the project was at it end, no further experimentswere done to search for the optimal pruning value for this test.

From these results it can be concluded that the pruning technique ascurrently implemented is not yet very good because even though it can ob-tain similar results as the system without pruning, no time improvement hasbeen measured. Furthermore, there seems to a problem with the algorithmbecause it sometimes outputs multiple words while it should not do that.Because there exist no literature on pruning in dbns (also, in hmms it can beconsidered as an experimental technique) the pruning algorithm currentlyimplemented is straightforward. At every time-slice during the forward passit locally selects the best paths which may not be the global best paths.Designing a more elaborate pruning algorithm for dbns that can look aheadcould be very useful.

7.5 Training Experiments

To test the training part of the toolkit a fresh model as in figure 5.1 wastrained on the Polyphone data. Silence phones were added at the beginningand end of each annotation because these were not present and I figuredthese would improve training because silences are present at the beginningand end of the audio files. Each training cycle took about 3 days to finish.It turned out however that the training results were incorrect because aftera few cycles the model parameters were trained such that most states (sub-phones) were immediately skipped except for a few states belonging to longerclear phones (for examle the /oo/). This behavior is illustrated in figure7.5 which shows the pdf’s belonging to each phone. For simplification thephones in this example only have one state and thus one pdf associated, inthe model each phone (P ) has 3 states (S) and thus 3 pdf’s (O) associated.The top figure shows how the model would look for the word ‘boom’ when notraining is done. As each pdf starts with the same untrained parameters the

74

Page 75: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Table 7.3: Results pruning parameter

System Recognition rate Approximate Timedbn system with parameters 0% (6/181) 20 minutescopied from reference systemP = 0.5dbn system with parameters 0,006% (18/181) 21 minutescopied from reference systemP = 1 ∗ 10−3

dbn system with parameters 0% (51/181) 23 minutescopied from reference systemP = 1 ∗ 10−13

dbn system with parameters 73% (181/181) 20 minutescopied from reference systemP = 1 ∗ 10−100

dbn system with parameters 73% (181/181) 20 minutescopied from reference systemP = 1 ∗ 10−75

dbn system with parameters 62% (181/181) 20 minutescopied from reference systemP = 1 ∗ 10−50

75

Page 76: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Figure 7.2: Simplified example which shows pdf distribution for phones ofan example word. Top: model not trained. Middle: how a trained modelcould look. Bottom: model after training with Gaia.

distributions look the same. The middle figure shows, as a reference, howa trained model could look like, the distribution for /b/ is short becauseit is a plosive phone and the distribution for /oo/ is long because it is along vocal phone. The bottom figure illustrates how the model looked aftertraining with the Gaia toolkit. Some states were assigned all probabilitymass during training such that other states’ pdf diminished. This behaviorbecame exponentially worse with each training cycle.

A reason for this could be that the data files are too long to be processedby the training algorithm. In training the em algorithm is used and it canonly guarantee a local optimum. It could be that this algorithm cannot copewith the size of training a large dbn model using data of this size. It wasnot possible to cut those files because there was no detailed annotation dataavailable. To test if training would work on smaller sized files I used anotherdifferent data set; a subset of the Polyphone corpus which contained a fewdigits being spoken in each file. The lexicon for this set consists of the digits0 to 9 and the set is about 2600 files big with a total size of 460 mb. Each

76

Page 77: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

training cycle took about 12 hours but the same problems occurred here.Another yet smaller set was tested thereafter. This subset of the Polyphonecorpus consisted of files that contained either a ‘yes’ or a ‘no’. 360 files witha total size of 30mb where trained in about 5 minutes but again the sameproblems occurred. As by this time the project was at it end, training usingthe dbn system remains unsuccessful.

77

Page 78: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

78

Page 79: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Chapter 8

Conclusion andRecommendations

In the introduction I stated the goals for the project as follows:

• Research what makes hmms successful as a asr model.

• Design dbn models for the acoustic model, language model and train-ing part of a speech recognizer.

• Design, implement and test a basic speech recognizer using the Gaiatoolkit.

• Further develop and test the Gaia toolkit.

The first goal was reached during the literature study. A summary ofthe conclusion can be found in section 2.8. I did not have the opportunityto implement any of these techniques in the dbn system so I will make somerecommendations in this chapter.

In chapter 5 the dbn models designed for the acoustic part of a speechrecognizer, for different language models and for training a speech recognizerare discussed. During the project improvements to initial models and thevariable distributions have been made of which the addition of the EOIvariable is the most important. The models for training the acoustic partand some simple language models have been tested. The dbn system iscurrently not fast enough to test the larger language models designed.

In chapter 6 I discussed the tools that were created to test the designedmodels and create a basic speech recognizer as the third goal. The recog-nition experiments proved that the inference part of the toolkit worked and

79

Page 80: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

that speech recognition is possible with dbns. It obtained comparable re-sults as a system created with htk on the test set.

Training using the dbn system has not been successful. There remains aproblem with training that leads to bad results. The model used in trainingworks in the recognition experiments and the training algorithms from theGaia toolkit work in tests, done outside this project, with generated data.Therefore the problem probably lies with the fact that the em training al-gorithm cannot cope with the large task of training this data.

The last goal I have reached by testing and debugging the programscreated and providing debug information about problems whose origins layin the code of the Gaia toolkit. An important improvement that was madeto the toolkit during the project was the inclusion of context files duringtraining. These context files are loaded for each data file separately andcontain distributions for variables which values are unknown in the model,as described in section 5.1.1.

As recommendations for future work I would prioritize solving the problemregarding training. This is an integral part of the asr system and thereforeit should be fixed. A possible solution could be to reduce the size of thetraining data even more, such that phones are trained separately. Thisis possible with the cgn annotation data. More elaborate training canbe done once the model is initiated like this. Another possible solutionis the use of a damping factor during training which outputs instead ofthe updated probabilities, an average over the original and the updatedprobability. This should help to slow down the grow of probability massfor the long clear phones and thus give other phones a chance to be trainedproperly. When training does work the next step could be to increase thespeed of the system. This is very useful for the recognition part since usingonly a moderately sized language model causes the system to be very slowfor practical use. If the dbn system is going to be compared to popular hmmbased systems it needs to be able to use much larger language models thanused for these tests. For the training algorithms the slow speed is somewhatless important because it already can make use of multiple program threadsto split the training task. An increase in speed for the recognition partcan be obtained by implementing approximate inference techniques such assampling methods or the Boyen-Koller algorithm which can make a trade-off between speed and accuracy. Furthermore, designing a new pruningtechnique for dbns could increase speed without much loss of accuracy asit does in hmm systems. A look-ahead technique could be used such that

80

Page 81: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

it does not prune paths that later in the inference algorithm turn out tobe the good ones. In section 2.8 I discussed techniques that are used toimprove hmm based asr systems. From those techniques it may be usefulto implement a two-pass technique to increase speed in the dbn system. Asthe Boyen-Koller algorithm allows a trade-off between speed and accuracyit could be used for both a fast first pass and an accurate second pass on theresults of the first pass. Another recommendation would be to improve theusability of creating the xml models as described in section 4.1.2. Creatingthe xml tables by hand is possible but having the ordering of variablesand values correct is sometimes difficult. Automating this would be reallyhelpful. As future work the design and implementation of more advancedmodels that make use of the possibilities of the dbn and the toolkit willbe very interesting. Models that use context information or systems thatincorporate several modalities of input are some examples.

81

Page 82: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

82

Page 83: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

Bibliography

[1] X. Boyen and D. Koller. Tractable inference for complex stochasticprocesses. In Proceedings of the Fourteenth Conference on Uncertaintyin Artificial Intelligence, pages 33–42, 1998.

[2] Stanley F. Chen and Joshua Goodman. An emperical study of smooth-ing techniques for language modeling. 1998.

[3] M. Damhuis, T. Boogaart, C. in’t Veld, M. Versteijlen, W. Schelvis,L. Bos, and Louis Boves. Creation and analysis of the dutch polyphonecorpus. In ICSLP-94, 1994.

[4] James H. Martin Daniel Jurafsky. Speech and Language Processing, AnIntroduction to Natural Language Processing, Computational Linguis-tics, and Speech Recognition. Prentice Hall, 2000.

[5] Cambridge University Engineering Department. http://htk.eng.cam.

ac.uk/, March 2009.

[6] Hynek Hermansky. Perceptual linear predictive (plp) analysis of speech.Joint Acoustic Society America, 87:1783 – 1752, 1990.

[7] Finn V. Jensen. Bayesian Networks and Decision Graphs. InformationScience and Statistics. Springer, July 2001.

[8] Star Laboratory. http://www.speech.sri.com/projects/srilm/,September 2008.

[9] Boost C++ Libraries. http://www.boost.org, May 2009.

[10] Kevin Murphy. A brief introduction to graphical models and bayesiannetworks. http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html, 1998.

83

Page 84: Automatic Speech Recognition using Dynamic Bayesian Networksmmi.tudelft.nl/pub/robvdl/ASR_using_DBNs.pdf · New ideas to improve automatic speech recognition have been proposed that

[11] Kevin Patrick Murphy. Dynamic Bayesian Networks: Representation,Inference and Learning. PhD thesis, University of California, Berkeley,2002.

[12] Judea Pearl. Probabilistic Reasoning in Intelligent System. 1988.

[13] B. Plannerer. An introduction to speech recognition. 2005.

[14] The N-Best Project. http://speech.tm.tno.nl/n-best/, June 2006.

[15] S. J. Russell and Norvig. Artificial Intelligence: A Modern Approach(Second Edition). Prentice Hall, 2003.

[16] Y. SJ, R. NH, and T. JHS. Token passing: A simple conceptual modelfor connected speech recognition systems, 1989.

[17] Dragon Systems. History of speech recognition and transcrip-tion softwarehistory. http://www.dragon-medical-transcription.com/

historyspeechrecognition.html, July 2008.

[18] Rob van de Lisdonk. An overview of techniques used in current hmm-based asr systems. Master’s thesis, TU Delft, 2007.

[19] Pascal Wiggers. Modelling Context in Automatic Speech Recognition.PhD thesis, TU Delft, 2008.

[20] G. Zweig and M. Padmanabhan. Exact alpha-beta computation inlogarithmic space with application to map word graph construction. InProceedings of ICSLP’00, 2000.

[21] Geoffrey G. Zweig. Speech Recognition with Dynamic Bayesian Net-works. PhD thesis, University of California, Berkeley, 1998.

84