speech lab - project report

7/23/2019 Speech Lab - Project Report

http://slidepdf.com/reader/full/speech-lab-project-report 1/44

Cairo University

Computer Engineering DepartmentGiza, 12613 EGYPT

Speech LabGraduation Project Report

Submitted by

Amr M. Medhat Sameh M. Serag Mostafa F. Mahmoud

In partial fulfillment of the B.Sc. Degree in Computer Engineering

Supervised by

Dr. Nevin M. Darwish

July 2004



ii

ABSTRACT

Speech has long been viewed as the future of computer interfaces, promising

significant improvements in ease of use and enabling the rise of variety of speech-

recognition-based applications. With the recent advances in speech recognitiontechnology, computer-assisted pronunciation teaching (CAPT) has emerged as a

tempting alternative to traditional methods of supplementing or replacing direct

student-teacher interaction.

Speech Lab is an Arabic pronunciation teaching system for teaching some of the Holy

Qur'an recitation rules. The objective is to detect learner's pronunciation errors and

provide diagnostic feedback. The heart of the system is a phone-level HMM-based

speech recognizer. The idea of comparing the learner's pronunciation with the correct

one of the teacher is based on identifying phone insertions, deletions or substitutions

resulting from the recognition of the learner's speech. In this work we focus on some

of the recitation rules targeting pronunciation problems of Egyptian learners.



iii

ACKNOWLEDGEMENT

First and foremost, we would like to thank Dr. Salah Hamid from The Engineering

Company for the Development of Computer Systems (RDI) for his generous and

enthusiastic guidance. Without his insightful and constructive advices and supports,this project would have not been achieved. We are deeply grateful to him and also to

Waleed Nazeeh and Badr Mahmoud for their helpful support.

In this project we made use of a series of lessons for teaching the Holy Qur'an

recitation rules by Sheikh Ahmed Amer, we are so grateful to him for these wonderful

lessons. Beside using their content, they were of great help in drawing the

methodology we worked within in the project.

We are also grateful to all our friends who helped us by recording the data to build the

speaker-independent database. They were really cooperative and helpful. And special

thanks to the artists Mohammed Abdul-Mon'em, Mahmoud Emam and Mohammed

Nour for their wonderful work that added beauty and elegance to our project.

Special thanks must go also to Dr. Goh Kawai for providing us with his valuable

paper on pronunciation teaching.

Finally, we would like to thank our supervisor Dr. Nevin Darwish, our parents and all

who supported us. Thanks all and thanks God.



iv

LIST OF ABBREVIATIONS

ASR Automatic Speech Recognition

CALL Computer-Assisted Language Learning

CAPT Computer-Assisted Pronunciation Teaching

EM Expectation Maximization

HMM Hidden Markov Model

HTK HMM Tool Kit

LPC Linear Predictive Coding

MFCC Mel Frequency Cepstral Coefficients

ML Maximum Likelihood



v

TABLE OF CONTENTS

ABSTRACT...................................................................................................................ii

ACKNOWLEDGEMENT........................................................................................... iiiLIST OF ABBREVIATIONS.......................................................................................iv

1. INTRODUCTION .....................................................................................................1

1.1 Motivation and Justification ................................................................................1

1.2 Problem definition ...............................................................................................1

1.3 Summary of Approach .........................................................................................1

1.4 Report overview...................................................................................................2

2. LITERATURE REVIEW ..........................................................................................3

2.1. Pronunciation Teaching ......................................................................................3

2.1.1 The Need for Automatic Pronunciation Teaching........................................3

2.1.2 Components of Pronunciation to Address ....................................................3

2.1.3 Previous Work ..............................................................................................4

2.1.4 Computer as a Teacher..................................................................................5

2.1.5 Components of an ASR-based Pronunciation Teaching System..................6

2.2 Speech Recognition .............................................................................................7

2.2.1 Speech Recognition System Characteristics.................................................7

2.2.2 Speech Recognition System Architecture.....................................................8

2.3 Phonetics and Arabic Phonology.......................................................................11

2.4 Speech Signal Processing ..................................................................................14

2.4.1 Feature Extraction.......................................................................................14

2.4.2 Building Effective Vector Representations of Speech................................15

2.5 HMM..................................................................................................................162.5.1 Introduction.................................................................................................16

2.5.2 Marcov Model.............................................................................................17

2.5.3 Hidden Markov Model................................................................................17

2.5.4 Speech recognition with HMM...................................................................18

2.5.5 Three essential problems.............................................................................19

2.5.6 Two important algorithms...........................................................................19

2.6 HTK ...................................................................................................................19

3. DESIGN AND IMPLEMENTATION ....................................................................21

3.1. Approach...........................................................................................................21

3.2. Design ...............................................................................................................22

3.2.1 System Design ............................................................................................223.2.2 Database Design..........................................................................................24

3.2.3 Constraints ..................................................................................................24

3.3. Speech Recognition with HTK .........................................................................24

3.3.1 Data preparation..........................................................................................24

3.3.2 Creating Monophone HMMs......................................................................25

3.3.3 Creating Tied-State Triphones....................................................................25

3.3.4 Increasing the number of mixture components...........................................26

3.3.5 Recognition and evaluation.........................................................................26

3.4. Experiments and results ....................................................................................27

3.4.1 Prototype.....................................................................................................27

3.4.2 Speaker-dependent system..........................................................................273.4.3 Speaker-independent system.......................................................................28



vi

3.5 Implementation of Other Modules.....................................................................28

3.5.1 Recognizer interface ...................................................................................28

3.5.2 String Comparator.......................................................................................28

3.5.3 Auxiliary Database......................................................................................29

3.5.4 User Profile Analyzer .................................................................................29

3.5.5 Feedback Generator ....................................................................................303.3.6 GUI .............................................................................................................30

4. CONCLUSION AND FUTURE WORK ................................................................32

REFERENCES ............................................................................................................33

A. USER MANUAL....................................................................................................35

B. TRAINING DATABASE .......................................................................................37

ARABIC SUMMARY.................................................................................................38



1

Chapter 1: Introduction

1. INTRODUCTION

1.1 Motivation and Justification

Teaching the Holy Qur'an recitation rules like pronunciation teaching generally, can

be repetitive, requiring drills and one-to-one attention that is not always available,

especially in large classes or if no teacher is available, plus it becomes very hard for

the teacher to detect the learners’ mistakes due to their large numbers.

Many systems have been developed for teaching people the recitation rules of the

Holy Qur’an, but the problem with such systems was that they lacked interaction as

they were based only on the user’s repetitive listening to the correct reading and

attempting to imitate it.Computer-assisted pronunciation teaching (CAPT) techniques are therefore attractive

as they allow self-paced practice outside a classroom, availability and real interaction

between the learner and the computer without the many-to-one problem of the class.

1.2 Problem definition

The idea in abstract way as shown in figure [1] is based on comparing the learner's

speech with the correct one and providing the learner with a feedback indicating the

place of the mispronunciation if any and guiding him to the correct pronunciation.

Learner's speech

System

Reference speech

Feedback

Figure [1] Schematic diagram of a typical CAPT system

The system mainly can be viewed as a problem of manipulating the learner's speech

with an automatic speech recognition (ASR) system to be able to compare it with thereference utterance and provide the proper feedback.

1.3 Summary of Approach

Most of the work done in this new application of ASR had the target of teaching

pronunciation to learners of a second language. There was also an attempt to build

CAPT system as a reading tutor for children.

One approach that has been used in detecting nonnative pronunciation characteristics

in foreign language speech sees differences between native and target language as

phone insertions, deletions and substitutions. So, a bilingual HMM-based phone



2

recognizer was used to identify pronunciations errors at the phone level, where

HMMs are trained on the phones of both languages.

In this project, we present a CAPT system for teaching some of the recitation rules of

the Holy Qur'an deploying a similar approach and targeting pronunciation problems

of Egyptian learners..

What makes our system a little bit different from others is that the learner in our case

is often able to pronounce all sounds in the Arabic language perfectly but he may not

be able to use the correct sounds (or phonemes) in the correct place when reading the

Holy Qur'an. So, the learner doesn't face the problems of foreign language learners

when they find new sounds in the new language don't exist in their native language

which is harder in training.

1.4 Report overview

We present in chapter 2 the background in which this project is undertaken and themain tool to be used in developing.

As for chapter 3, we explain the approach and design of our system and how we

implemented it.

We discuss in chapter 4 the conclusion to be made behind this project and the future

work we have in mind.



3

Chapter 2: Literature Review

2. LITERATURE REVIEW

2.1. Pronunciation Teaching

2.1.1 The Need for Automatic Pronunciation Teaching

During the past two decades, the exercise of spoken language skills has received

increasing attention among educators. Foreign language curricula focus on productive

skills with special emphasis on communicative competence. Students' ability to

engage in meaningful conversational interaction in the target language is considered

an important, if not the most important, goal of second language education.

According to Eskinazi, the use of an automatic recognition system to help a user

improve his accent and pronunciation is appealing for at least two reasons: first, it

affords the user more practice time than a human teacher can provide, and second, the

user is not faced with the sometimes overwhelming problem of human judgment of

his production of “foreign” sounds [1]. To figure out more about the importance of

the system, it is important to recognize the specific difficulties encountered in

pronunciation teaching:

• Explicit pronunciation teaching requires the sole attention of the teacher to a

single student; this poses a problem in a classroom environment.

• Learning pronunciation can involve a large amount of monotonous repetition,

thus requiring a lot of patience and time from the teacher.

• With pronunciation being a psycho-motoric action, it is not only a mental task but also demands coordination and control over many muscles. Given the

social implications of the act of speaking it can also mean that students are

afraid to perform in the presence of others.

• In language tests the oral component is costly time consuming and subjective,

therefore an automatic method of pronunciation assessment is highly

desirable.

Additionally, all arguments for the usefulness of CALL systems apply here as well,

such as being available at all times and being cheaper.

All these reasons indicate that computer-based pronunciation teaching is not onlydesirable for self-study products but also for products which would complement the

teaching aids available to a language teacher [2].

2.1.2 Components of Pronunciation to Address

The accuracy of pronunciation is determined by both segmental and supra-segmental

features.

The segmental features are concerned with the distinguishable sound units of speech,

i.e. phonemes. A phoneme is also defined as "the smallest unit which can make a

difference in meaning". The set of phonemes of one language can be classified into

broad phonetic subclasses; for example, the most general classification as we will seein section 2.4 would be to separate vowels and consonants. Each language is



4

characterized by its distinctive set of phonemes. When learning a new language,

foreign students can divide the phonemes of the target language into two groups. The

first group contains those phonemes which are similar to the ones in his or her source

language. The second group contains those phonemes which do not exist in the source

language [6].

Teaching the pronunciation of segmental or phonetic features includes teaching the

correct pronunciation of phonemes and the co-articulation of phonemes into higher

phonological units, i.e., teaching phonemes pronunciation in isolation and in context

with other phonemes within words or sentences after that.

The supra-segmental features of speech are the prosodic aspects which comprise of

intonation, pitch, rhythm and stress. Teaching the pronunciation of prosodic features

includes teaching the following [3]:

• the correct position of stress at word level;

• the alternation of stress and unstressed syllables compensation and vowel

reduction;• the correct position of sentence accent;

• the generation of the adequate rhythm from of stress, accent, and phonological

rules;

• the generation of adequate intonational pattern utterance related to

communicative functions.

For beginners phonetic characteristics are of greater importance because these cause

mispronunciations. With increasing fluency more emphasis should be on teaching

prosody. But the focus here will be on teaching phonetics since teaching prosody

usually requires a different teaching approach.

2.1.3 Previous Work

Over the last decade several research groups have started to develop interactive

language teaching systems incorporating pronunciation teaching based on speech

recognition techniques. There was the SPELL project from [Hiller, 1993] which

concentrated on teaching pronunciation of individual words or short phrases plus

additional exercises for intonation, stress and rhythm. However, this system

concentrated on one sound at a time, for instance the pair "thin-tin" is used to train the

'th' sound, but it did not check whether the remaining phonemes in the word were

pronounced correctly.

Another early approach based on dynamic programming and vector quantization by

[Hamada, 1993] is likewise limited to word level comparisons between recordings of native and non-native utterances of a word. Therefore, their system required new

recordings of native speech for each new word used in the teaching system. This

system is called a text-dependent system in contrast to a text-independent one, where

the teaching material can be adjusted without additional recordings.

The systems described by [Bernstein, 1990] and [Neumeyer, 1996] were capable of

scoring complete sentences but not smaller units of speech.

The system used by [Rogers, 1994] was originally designed to improve the

intelligibility of hearing for speech impaired people. It was text-dependent and

evaluated isolated word pronunciations only.



5

The system described by [Eskenazi 1996] was also text dependent and compared the

log-likelihood scores produced by a speaker independent recognizer of native and

non-native speech for a given sentence [2].

The European funded project ISLE [1998] is another example that aims to develop a

system that improves the English pronunciation of Italian and German native

speakers.

There is also the LISTEN project which is an inter-disciplinary research project at

Carnegie Mellon University to develop a novel tool to improve literacy - an

automated Reading Tutor that displays stories on a computer screen, and listens to

children read aloud.

Beside all these systems, there was also a work done to build some tools to support

this research in pronunciation assessment. Eduspeak [2000] by SRI International is an

example. It is a speech recognition toolkit that consists of a speech recognition

module and acoustical native and non-native models for adults and children. It also

has some score algorithms that make use of spectral matching and duration of sounds.

2.1.4 Computer as a Teacher

Success of an automatic pronunciation training system depends on how perfect it acts

as a human teacher in a classroom. The following are some issues to be considered in

a CAPT system to be able to assist or even replace teachers:

1. Evaluation

In pronunciation exercises there exists no clearly right or wrong answer. A large

number of different factors contribute to the overall pronunciation quality and these

are also difficult to measure. Hence, the transition from poor to good pronunciation isa gradual one and any assessment must also be presented on a graduated scale using a

scoring technique [2].

2. Integration into a complete educational system

For practical applications, any scoring method will have to be embedded within an

interactive language teaching system containing modules for error analysis,

pronunciation lessons, feedback and assessment. These modules can take results from

the core algorithm to give the student detailed feedback about the type of errors which

occurred, using both visual and audio information. For instance, in those cases where

a phoneme gets rejected because of a too poor score, the results of the phoneme loopindicate what has actually been recognized. This information can then be used for

error correction [2]. [Hiller 1996] presented a useful paradigm for a CALL

pronunciation teaching system called DELTA consisting of the four stages of

learning:

• Demonstrate the lesson audibly.

• Evaluate listening of the student ability with small tests.

• Teach with pronunciation exercises.

• Assess the progress made per lesson.



6

3. Adaptive Feedback

A perfect CAPT system function is not just to tell the user blindly: well done or

wrong, repeat again! It should be more intelligent like an actual teacher.

In natural conversations, a listener may interrupt the talker to provide a correction or

simply point out the error. But the talker might not understand his message and askshis listener for a clarification. So, a correctly formed message usually results from an

ensuing dialogue in which meaning is negotiated.

Ideally teachers point out incorrect pronunciation at the right time, and refrain from

intervening too often in order to avoid discouraging the student from speaking. They

also intervene soon enough to prevent errors from being repeated several times and

from becoming hard-to-break habits [5].

So, a perfect system that acts as a real teacher should consider the following [4, 5]:

• Addressing the error precisely, so the part of the word that was mispronounced

should be precisely located within the word.

• The addressed error should be used to modify the native utterance so that the

mispronounced component is emphasized by being louder, longer and possibly

with higher pitch. The student then says the word again and the system

repeats.

• Correcting only when necessary, reinforcing good pronunciation, and avoiding

negative feedback to increase student's confidence.

• The pace of correction, that is, the maximum amount of interruptions per unit

of time that is tolerable, should be adapted to fit each student's personality;

since adaptive feedback is important to obtain better results from correction

and to avoid discouraging the student.

2.1.5 Components of an ASR-based Pronunciation Teaching System

The ideal ASR-based CAPT system can be described as a sequence of five phases, the

first four of which strictly concern ASR components that are not visible to the user,

while the fifth has to do with broader design and graphical user interface issues [7].

1. Speech recognition

The ASR engine translates the incoming speech signal into a sequence of words on

the basis of internal phonetic and syntactic models. This is the first and most

important phase, as the subsequent phases depend on the accuracy of this one. It isworth mentioning that a speaker-dependent system is more appropriate in teaching

foreign language pronunciation [2]. Details of this phase will be presented later.

2. Scoring

This phase makes it possible to provide a first, global evaluation of pronunciation

quality in the form of a score. The ASR system analyzes the spoken utterance that has

been previously recognized. The analysis can be done on the basis of a comparison

between temporal properties (e.g. rate of speech) and/or acoustic properties of the

student’s utterance on one side, and natives’ reference properties on the other side; the

closer the student’s utterance comes to the native models used as reference, the higher

the score will be..



7

3. Error detection

In this phase the system locates the errors in the utterance and indicates to the learner

where he made mistakes. This is generally done on the basis of so-called confidence

scores that represent the degree of certainty of the ASR system by matching the

recognized individual phones within an utterance with the stored native models that

are used as a reference.

4. Error diagnosis

The ASR system identifies the specific type of error that was made by the student and

suggests how to improve it, because a learner may not be able to identify the exact

nature of his pronunciation problem alone. This can be done by resorting to

previously stored models of typical errors that are made by non-native speakers.

5. Feedback presentation

This phase consists in presenting the information obtained during phases 2, 3, and 4 to

the student. It should be clear that while this phase implies manipulating the various

calculations made by the ASR system, the decisions that have to be taken here – e.g.

presenting the overall score as a graded bar, or as a number on a given scale – have to

do with design, rather than with the technological implementation of the ASR system.

This phase is fundamental because the learner will only be able to benefit from all the

information obtained by means of ASR if this is presented in a meaningful way.

2.2 Speech RecognitionSpeech recognition is the process of converting an acoustic signal, captured by a

microphone or a telephone, to a set of words. The recognized words can be the final

results, or they can serve as the input to further linguistic processing.

2.2.1 Speech Recognition System Characteristics

Speech recognition systems can be characterized by many parameters, some of the

more important of which are shown in table [1] below [8].

Table [1]: Typical parameters used to characterize the capability of speechrecognition systems



8

An isolated-word speech recognition system requires that the speaker pause briefly

between words, whereas a continuous speech recognition system does not.

Spontaneous, or extemporaneously generated, speech contains disfluencies, and it is

much more difficult to recognize than speech read from script. Some systems require

speaker enrollment where a user must provide samples of his or her speech before

using them, whereas other systems are said to be speaker-independent, in that noenrollment is necessary. Some of the other parameters depend on the specific task.

Perplexity indicates the language’s branching power, with low-perplexity tasks

generally having a lower word error rate. Recognition is generally more difficult

when vocabularies are large or have many similar-sounding words. Finally, there are

some external parameters that can affect speech recognition system performance,

including the characteristics of the background noise (Signal to Noise Ratio) and the

type and the placement of the microphone [8].

2.2.2 Speech Recognition System Architecture

The process of speech recognition starts with a sampled speech signal. This signal has

a good deal of redundancy because the physical constraints on the articulators that

produce speech - the glottis, tongue, lips, and so on - prevent them from moving

quickly. Consequently, the ASR system can compress information by extracting a

sequence of acoustic feature vectors from the signal. Typically, the system extracts a

single multidimensional feature vector every 10 ms that consists of 39 parameters.

Researchers refer to these feature vectors, which contain information about the local

frequency content in the speech signal, as acoustic observations because they

represent the quantities the ASR system actually observes. The system seeks to infer

the spoken word sequence that could have produced the observed acoustic sequence.

[9]

It is assumed that ASR system knows the speaker’s vocabulary previously. Thisrestricts the search for possible word sequences to words listed in the lexicon, which

lists the vocabulary and provides phoneme s for the pronunciation of each word.

Language constraints are also used to dictate whether the word sequences are equally

likely to occur [9]. Training data are used to determine the values of the language and

phone model parameters

The dominant recognition paradigm is known as hidden Markov models (HMM). An

HMM is a doubly stochastic model, in which the generation of the underlying

phoneme string and the frame-by-frame, surface acoustic realizations are both

represented probabilistically as Markov processes. Neural networks have also been

used to estimate the frame based scores; these scores are then integrated into HMM- based system architectures, in what has come to be known as hybrid systems or

Hybrid HMM [8].

An interesting feature of a frame-based HMM systems is that speech segments are

identified during the search process, rather than explicitly. An alternate approach is to

first identify speech segments, then classify the segments and use the segment scores

to recognize words. This approach has produced competitive recognition performance

in several tasks [8]. Our system will be an HMM-based one.

The speech recognition process as a whole can be seen as a system of five basic

components as in figure [2] below: (1) an acoustic signal analyzer which computes a

spectral representation of the incoming speech; (2) a set of phone models (HMMs)

trained on large amounts of actual speech data; (3) a lexicon for converting sub-word



9

phone sequences into words; (4) a statistical language model or grammar network that

defines the recognition task in terms of legitimate word combinations at the sentence

level; (5) a decoder, which is a search algorithm for computing the best match

between a spoken utterance and its corresponding word string [10].

Figure [2]: Components of a typical speech recognition system.

1. Signal Analysis

The first step which will be presented in detail later, consists of analyzing the

incoming speech signal. When a person speaks into an ASR device __usually through

a high quality noise-canceling microphone__ the computer samples the analog input

into a series of 16 or 8-bit values at a particular sampling frequency (usually 16 KHz).

These values are grouped together in predetermined overlapping temporal intervals

called "frames". These numbers provide a precise description of the speech signal's

amplitude. In a second step, a number of acoustically relevant parameters such as

energy, spectral features, and pitch information, are extracted from the speech signal.

During training, this information is used to model that particular portion of the speech

signal. During recognition, this information is matched against the pre-existing model

of the signal [10].

2. Phone Models

The second module is responsible for training a machine to recognize spoken

language amounts by modeling the basic sounds of speech (phones). An HMM canmodel either phones or other sub-word units or it can model words or even whole

sentences. Phones are either modeled as individual sounds, so-called monophones, or

as phone combinations that model several phones and the transitions between them

(biphones or triphones). After comparing the incoming acoustic signal with the

HMMs representing the sounds of language, the system computes a hypothesis based

on the sequence of models that most closely resembles the incoming signal. The

HMM model for each linguistic unit (phone or word) contains a probabilistic

representation of all the possible pronunciations for that unit.

Building HMMs in the training process, requires a large amount of speech data of the

type the system is expected to recognize [10].



10

3. Lexicon

The lexicon, or dictionary, contains the phonetic spelling for all the words that are

expected to be observed by the recognizer. It serves as a reference for converting the

phone sequence determined by the search algorithm into a word. It must be carefully

designed to cover the entire lexical domain in which the system is expected to perform. If the recognizer encounters a word it does not "know" (i.e., a word not

defined in the lexicon), it will either choose the closest match or return an out-of-

vocabulary recognition error. Whether a recognition error is registered as

misrecognition or an out-of-vocabulary error depends in part on the vocabulary size.

If, for example, the vocabulary is too small for an unrestricted dictation task (let's say

less than 3K) the out-of-vocabulary errors are likely to be very high. If the vocabulary

is too large, the chance of misrecognition errors increases because with more similar-

sounding words, the confusability increases. The vocabulary size in most commercial

dictation systems tends to vary between 5K and 60K [10].

4. The Language Model

The language model predicts the most likely continuation of an utterance on the basis

of statistical information about the frequency in which word sequences occur on

average in the language to be recognized. For example, the word sequence "A bare

attacked him" will have a very low probability in any language model based on

standard English usage, whereas the sequence "A bear attacked him" will have a

higher probability of occurring. Thus the language model helps constrain the

recognition hypothesis produced on the basis of the acoustic decoding just as the

context helps decipher an unintelligible word in a handwritten note. Like the HMMs,

an efficient language model must be trained on large amounts of data, in this case

texts collected from the target domain.

In ASR applications with constrained lexical domain and/or simple task definition, the

language model consists of a grammatical network that defines the possible word

sequences to be accepted by the system without providing any statistical information.

This type of design is suitable for pronunciation teaching applications in which the

possible word combinations and phrases are known in advance and can be easily

anticipated (e.g., based on user data collected with a system pre-prototype). Because

of the a priori constraining function of a grammar network, applications with clearly

defined task grammars tend to perform at much higher accuracy rates than the quality

of the acoustic recognition would suggest [10].

5. Decoder

The decoder is an algorithm that tries to find the utterance that maximizes the

probability that a given sequence of speech sounds corresponds to that utterance. This

is a search problem, and especially in large vocabulary systems careful consideration

must be given to questions of efficiency and optimization, for example to whether the

decoder should pursue only the most likely hypothesis or a number of them in parallel

(Young, 1996). An exhaustive search of all possible completions of an utterance

might ultimately be more accurate but of questionable value if one has to wait two

days to get a result. Therefore there are some trade-offs to maximize the search results

while at the same time minimizing the amount of CPU and recognition time.



11

2.3 Phonetics and Arabic Phonology

Phonetics studies all the sound of speech; trying to describe how they are made, to

classify them and to give some idea of their nature. Phonetic investigation shows that

human beings are capable of producing an enormous number of speech sounds,

because the range of articulatory possibilities is vast, although each language uses

only some of the sounds that are available [11].Even more importantly, each language

organizes and makes use of the sounds in its own particular way.

The study of the selection that each language makes from the vast range of possible

speech sounds and of how each language organizes and uses the selection it makes is

called Phonology. In other words, Phonetics describes and classifies the speech

sounds and their nature while Phonology studies how they work together and how

they are used in a certain language where differences among sounds serve to indicate

distinctions of meaning [11].

Obviously, not all the differences between speech sounds are significant, and not only

this, the difference between two speech sounds can be significant in one language but

not in another one. A list of sounds whose differences from one another are

significant can be built up by making a comparison between words of the same

language. These significant or distinctive sounds are the elements of the sound system

and are known as phonemes. Whereas the different sounds that not make any

difference are know as allophones [11].

In Arabic, there are 37 distinct phonemes [12], but when it comes to the Holy Qur'an,

the nature of its rules requires defining a new set of phonemes where distinguishing

between the correct and wrong way of reading in some rules cannot be detected using

only the standard set of Arabic phonemes. Below is the set of phonemes we definedfor some – not all – of the Qur'an phonemes that we needed for the rules we teach in

our system.

Index Phoneme Notation

1 /a:/

2 /b/

3 /t/

4 )( /th/

5 ) (

6 )( /dj/

7 )( /g/8 )( /j/

9 /h:/

10 /kh/

11 /d/

12 )( /dh/

13 ) (

14 /r/

15 /z/

16 /s/

17 /sh/18 /s:/



12

19 /d:/

20 /t:/

21 )( /zh:/

22 ) ( /z:/

23 /e/

24 /gh/25 /f/

26 /q/

27 /k/

28 /l/

29 /m/

30 /n/

31 /h/

32 /w/

33 /y/

3 /a_l/35 /a_h/

36 /i/

37 /u/

38 /aa_h/

39 /aa_l/

40 /uu/

41 /ii/

Table [2] Phoneme set

If we come to phonetics, we will find lots of classifications of Arabic speech sounds,

the following is a list of theses different classifications based on three different bases.

[12, 13, 14].

In the human speech production process, the most basic way to classify speech sounds

is to separate them into two groups of vowels and consonants according to whether or

not they involve significant constriction of the vocal tract .

• Vowels: -----

• Consonants: the rest of Arabic letters.

The second basis of classification is according to voicing properties Arabic phonemes

can be classified into:• Glottal stop ( ):

• Unvoiced: -----------

• Voiced: the rest of Arabic letters.

The third classification is according to place of articulation:

• From Larynx ( ): -

• From Throat ( ): –

• Velar ( - ):

• From soft-palate ( ): ---

• From hard-palate ( ): – -



13

• From Gum ( ): --

• Alveolar ( - ): ------

• Dental ( ): --

• Labiodental ( ) :

• Bilabial ( ): -

There is also another secondary classification according to the properties of some

particular phonemes:

• Emphasis ( ): ------

Where, High emphasis phonemes are: --- and Semi-

emphasized phonemes are: --

• Sibilant( ): ---

• Extent ( ):

• Spread ( ):

• Deviation ( ): -

• Unrest ( ): ----

• Snuffle( ): -

For the first classification of consonants and vowels, phonemes can be further

classified to the following subclasses:

Consonants are classified according to the manner of articulation to:

• Plosives or Stops( ): ------

--

• Fricatives ( ): ----------

----• Laterals ( ):

• Trills ( ):

• Affricates ( ):

• Nasals ( ): –

• Glides ( ): –

• Liquids ( ): ---

Vowels on the other hand have different classifications; the first is according to the

tongue hump position:

• Back: -• Mid: -

• Front: -

Another classification of vowels:

• Long vowels: - -

• Short vowels: --

A third classification of vowels:

• IVowels: -

• UVowels: -

• AVowels: -

The last classification for vowels is according to lip rounding:

• With rounding lips: -



14

• Without: ---

These classifications will be useful later in the process grouping the phonemes with

similar properties in building a recognizer.

2.4 Speech Signal Processing

2.4.1 Feature Extraction

Once a signal has been sampled, we have huge amounts of data, often 16,000 16 bit

numbers a second! We need to find ways to concisely capture the properties of the

signal that are important for speech recognition before we can do much else. Probably

the most important parametric representation of speech is the spectral representation

of the signal, as seen in a spectrogram1 which contains much of the information we

need. We can obtain the spectral information from a segment of the speech signal

using an algorithm called the Fast Fourier Transform. But even a spectrogram is far too complex a representation to base a speech recognizer on. This section describes

some methods for characterizing the spectra in more concise terms [15].

Filter Banks: One way to more concisely characterize the signal is by a filter bank.

We divide the frequency range of interest (say 100-8000Hz) into N bands and

measure the overall intensity in each band. This could be computed from spectral

analysis software such as the Fast Fourier Transform). In a uniform filter bank , each

frequency band is of equal size. For instance, if we used 8 ranges, the bands might

cover the frequency ranges: 100Hz-1000Hz, 1000Hz-2000Hz, 2000Hz-3000Hz, ...,

7000Hz-8000Hz.

But, is it a good representation? We’d need to compare the representations of differentvowels for example and see whether the vector reflects differences in these vowels or

not. If we do this, we’ll see there are some problems with a uniform filter bank. So, a

better alternative is to organize the ranges using a logarithmic scale. Another

alternative is to design a non-uniform set of frequency bands that has no simple

mathematical characterization but better reflects the responses of the ear as

determined from experimentation. One very common design is based on perceptual

studies to define critical bands in the spectra. A commonly used critical band scale is

called the Mel scale which is essentially linear up to 1000 Hz and logarithmic after

that. For instance, we might start the ranges at 200 Hz, 400 Hz, 630 Hz, 920 Hz, 1270

Hz, 1720 Hz, 2320 Hz, and 3200 Hz.

LPC: A different method of encoding a speech signal is called Linear Predictive

Coding (LPC). The basic idea of LPC is to represent the value of the signal over some

window at time t, s(t) in terms of an equation of the past n samples, i.e.,

1

A spectrogram is an image that represents the time-varying spectrum of a signal. The x-axisrepresents time, the y-axis frequency and the pixel intensity represents the amount of energy infrequency band y, at time x.



15

Of course, we usually can’t find a set of a i’s that give an exact answer for every

sample in the window, so we must settle for the best approximation, s(t), that

minimizes the error.

MFCC: Another technique that has proven to be effective in practice is to compute a

different set of vectors based on what are called the Mel Frequency Cepstral

Coefficients (MFCC). These coefficients provide a different characterization of the

spectra than filter banks and work better in practice. To compute these coefficients,

we start with a filter bank representation of the spectra. Since we are using the banks

as an intermediate representation, we can use a larger number of banks to get a better

representation of the spectra. For instance, we might use a Mel scale over 14 banks

(ranges starting at 200, 260, 353, 493, 698, 1380, 1880, 2487, 3192, 3976, 4823,

5717, 6644, and 7595). The MFCCs are then computed using the following formula:

where N is the desired number of coefficients. What this is doing is computing aweighted sum over the filter banks based on a cosine curve. The first coefficient, c0, is

simply the sum of all the filter banks, since i = 0 makes the argument to the cosine

function 0 throughout, and cos(0)=1. In essence it is an estimate of the overall

intensity of the spectrum weighting all frequencies equally. The coefficient c1 uses a

weighting that is one half of a cosine cycle, so computes a value that compares the

low frequencies to the high frequencies. The function for c2 is one cycle of the cosine

function, while for c3 it is one and a half cycles, and so on.

2.4.2 Building Effective Vector Representations of Speech

Whether we use the filter bank approach, the LPC approach or any other approach, we

end up with a small set of numbers that characterize the signal. For instance, if we

used the Mel-scale with dividing the spectra into 7 frequency ranges, we have reduced

the representation of the signal over the 20 ms segment to a vector consisting of eight

numbers. With a 10 ms shift in each segment, we are representing the signal by one of

these vectors every 10 ms. This is certainly a dramatic reduction in the space needed

to represent the signal. Rather than 16,000 numbers per second, we now represent the

signal by 700 numbers a second!

Just using the six spectral measures, however, is not sufficient for large-vocabulary

speech recognition tasks. Additional measurements are often taken that capture

aspects of the signal not adequately represented in the spectrum. Here are a fewadditional measurements that are often used:

Power: It is a measure of the overall intensity. If the segment Sk contains N samples

of this signal, s(0),..., s(N-1), then the power power(Sk ) is computed as

following:

Power(Sk ) = Σi=1,N-1 s(i)^2.

An alternative that doesn’t create such a wide difference between low and soft sounds

uses the absolute value:

Power(Sk ) = Σi=1,N-1 |s(i)|.



16

One problem with direct power measurements is that the representation is very

sensitive to how loud the speaker is speaking. To adjust for this, the power can be

normalized by an estimate of the maximum power. For instance, if P is the maximum

power within the last 2 seconds, the normalized power of the new segment would be

power(Sk )/P. The power is an excellent indicator of the voiced/unvoiced distinction,

and if the signal is especially noise-free, can be used to separate silence from lowintensity speech such as unvoiced fricatives. But we don't need it in the MFCC since

the power is estimated well by the c0 coefficient.

Power Difference: The spectral representation captures the static aspects of a signal

over the segment, but we have seen that there is much information in the transitions in

speech. One way to capture some of this is to add a measure to each segment that

reflects the change in power surrounding it. For instance, we could set:

PowerDiff(Sk )= power(Sk +1)-power(Sk -1).

Such a measure would be very useful for detecting stops.

Spectral Shifts: Besides shifts in overall intensity, we saw that frequency shifts in the

formants can be quite distinctive, especially in looking at the effects of consonants

next to vowels. We can capture some of this information by looking at the difference

in the spectral measures in each frequency band. For instance, if we have eight

frequency intensity measures for segment Sk , f k (1),...,f k (8), then we can define the

spectral change for each segment as with the power difference, i.e., df k (i) = f k -1(i)-

f k +1(i)

With all these measurements, we would end up with 18-number vector, the 8 spectral

band measures, eight spectral band differences, the overall power and the power difference. This is a reasonable approximation of the types of representations used in

current state-of-the-state speech recognition systems. Some systems add another set of

values that represent the “acceleration”, and would be computed by calculating the

differences between the df k values.

2.5 HMM

2.5.1 Introduction

A hidden Markov model (HMM) is a stochastic generative process that is particularly

well suited to modeling time-varying patterns such as speech. HMMs represent

speech as a sequence of observation vectors derived from a probabilistic function of a

first-order Markov chain. Model ‘states’ are identified with an output probability

distribution that describes pronunciation variations, and states are connected by

probabilistic ‘transitions’ that capture durational structure. An HMM can thus be

used as a ‘maximum likelihood classifier’ to compute the probability of a sequence of

words given a sequence of acoustic observations using Viterbi search. The basics of

HMM will be discussed in the following sub-subsections. More information can be

found in [14, 16 and 17].



17

2.5.2 Marcov Model

In order to understand the HMM, we must first look at a Markov model and a

stochastic process in general. A stochastic process specifies certain probabilities of

some events and the relations between the probabilities of the events in the same

process at different times. A process is called Markovian if the probability at one time

is only conditioned on a finite history. Therefore, a Markov model is defined as a

finite state machine which changes state once every time unit. State is a concept used

to help understand the time evolution of a Markov process. Being in a certain state at

a certain time is then the basic event in a Markov process. A whole Markov process

thus produces a sequence of states S= s1, s2 … sT.

2.5.3 Hidden Markov Model

The HMM is an extension of a Markov process. A hidden Markov model can be

viewed as a Markov chain, where each state generates a set of observations. You only

see the observations, and the goal is to infer the hidden state sequence. For example,the hidden states may represent words or phonemes, and the observations represent

the acoustic signal. Figure [3] shows an example of such a process where the six state

model moves through the state sequence S = 1; 2; 2; 3; 4; 4; 5; 6 in order to generate

the sequence o1 to o6.

Figure [3] The Markov Generation Model

Each time t that a state j is entered, a speech vector ot is generated from the

probability density b j(ot ). Furthermore, the transition from state i to state j is also

probabilistic and is governed by the discrete probability aij .

Thus, we can see that the stochastic process of an HMM is characterized by two set of

probabilities.

The first set is the transition probabilities and are defined as:



18

This can also be written in matrix form A ={ aij }. For the Markov process itself,

when the previous state is known, there is a certain probability to transit to each of the

other states.

The second is the observation probability where the speech signal is converted into a

time sequence of observation vectors ot defined in an acoustic space. The sequence of

vectors is called an observation sequence O= o1 , o2 .. oT with each ot a staticrepresentation of speech at t . The observation probability is defined as:

with its matrix form B ={bj }.

The composition of the parameters Μ = ( A, B) defines an HMM. (In the HMM

literature there is another set of parameters, the probability that the HMM starts at

initial time Π= { π j }). The model becomes λ = ( A, B, Π) depending on three

parameters. However, for cases like ours where all the HMM always start at the first

state, s0 1 = , this Π can be included in A.

2.5.4 Speech recognition with HMM

The basic way of using HMM in speech recognition is to model different well defined

phonetic units wl (e.g., words or sub-word units or phonemes) in an inventory { wl }

for the recognition task, with a set of HMMs (each with parameter λl ). To recognize a

word wk from an unknown O is to find basically:

The probability P is usually calculated indirectly using Bayes' rule:

Here P (O) is constant for a given O over all possible wl . The a priori probability P (wl

) only concerns the language model of the given task, which we assume here to be

constant too. Then the problem of recognition is converted to calculation of P (O | wl ).

But we use λl to model wl , therefore we actually need to calculate P (O | λl ).

We can see that the joint probability of O and S being generated by the model λ can

be calculated as following:

Where, the transitions occurring at different times and in different states areindependent and therefore:

And also for a given state S , the observation probability is:

However, in reality, the state sequence S is unknown. Then one has to sum the

probability P (O , S | λ) over all S in order to get P (O | λ) .



19

2.5.5 Three essential problems

In order to use HMM in ASR, a number of practical problems have to be solved.

1. The evaluation problem: One has to evaluate the value P (O | λ) given only O

and λ, but not S . Without an efficient algorithm, one has to sum over nT possible S

with a total of 2T. n

T

⋅calculations, which is impractical.2. The estimation problem: The values of all λl in a system have to be determined

from a set of sample data. This is called training . The problem is how to get an

optimal set of λl that leads to the best recognition result, given a training set.

3. The decoding problem: Given a set of well trained λl and an O with an unknown

identity, one has to find P l (O | λ) for all λl . In the recognition process, for each

single λl , one hopes, instead of summing over all S , to find a single S M that is most

likely associated with O. S M also provides the information of boundaries between

the concatenated phonetic or linguistic units that are most likely associated with

O. The term decoding refers to finding the way that O is coded onto S . In both the

training and recognition processes of a recognition system, problem 1 is involved.

2.5.6 Two important algorithms

The two important algorithms that solve the essential problems are both named after

their inventors: the Baum-Welch algorithm (Baum et al., 1970) for parameter

estimation in training, and the Viterbi algorithm for decoding in recognition (in some

recognizers the Viterbi algorithm is also used for training).

The essential part of the Baum-Welch algorithm is a so-called expectation-

maximization (EM) procedure, used to overcome the difficulty of incomplete

information about the training data (the unknown state sequence). In the most

commonly used implementation of the EM procedure for speech recognition, a

maximum-likelihood (ML) criterion is used. The solutions for the ML equations givethe closed-form formulae for updating HMM parameters given their old values In

order to obtain good parameters, a good initial set of parameters is essential, since the

Baum-Welch algorithm only gives a solution for a local optimum. However, for

speech recognition, such a solution often leads to sufficiently well performance.

The basic shortcoming of the ML training is that maximizing the likelihood that the

model parameters generate the training observations is not directly related to the

actual goal of reducing the recognition error, which is to maximize the discrimination

between the classes of patterns in speech.

The Viterbi algorithm essentially avoids searching through an unmanageably large

space of HMM states to find the most likely state sequence S M by using step-wiseoptimal transitions. In most cases, the state sequence S M yields satisfactory results for

recognition. But in other cases, S M does not give rise to state sequence corresponding

to the most correct words.

2.6 HTK

One of the optimal tools for speech recognition research is the HMM Toolkit,

abbreviated as HTK. It is a well-known and free toolkit for use in research into

automatic speech recognition and other pattern recognition systems such as hand-

writing recognition and facial recognition. It has been developed by the Speech

Vision Robotics Group at the Cambridge University Engineering Department andEntropic Ltd [18].



20

The toolkit consists of a set of modules for building Hidden Markov Models (HMMs)

which can be called from both command line and script file(s). The following are their

main functions:

1. Receiving audio input from the user.

2. Coding the audio files.

3. Building the grammar and dictionary for the application.

4. Attaching the recorded utterances to their corresponding transcriptions.

5. Building the HMMs.

6. Adjusting the parameters of the HMMs using the training sets.

7. Recognizing the user's speech using the Viterbi algorithm.

8. Comparing the testing speech patterns with the reference speech patterns.

In the actual processing, the HTK firstly parameterizes features of speech data to

various forms such as Linier Predictive Coding (LPC) and Mel-Cepstrum. Then, it

will estimate the HMM parameters using the Baum-Welch Algorithm for training.

Recognition tests are executed by estimating the best hypothesis from given featurevectors and from a language model using the Viterbi algorithm which finds the

maximum likelihood state sequence. Results are given with recognition percentage as

well as numbers of deletion, substitution and insertion errors.



21

Chapter 3: Design and Implementation

3. DESIGN AND IMPLEMENTATION

3.1. Approach

The approach we adopted in our system considers the systemic and structural

differences between the learner's utterance and the correct utterance as phone

insertions, deletions and substitutions [19].

This requires a phone recognizer trained on the correct phones and the wrong ones

that may be inserted or substituted by the learner. Knowledge of phonetics, phonology

and pedagogy is needed to know the different possible mispronunciations of each

phone.An example of phone substation problem of the word " " is shown in figure [4]

where usually learners encounter the problem of the emphatic pronunciation of the

first letter ( ) in this word which appears in the vowel after it. So, the

correct phone /a_l/ may be replaced with /a_h/ (see the phonology table in section

2.3).

n r s:

a_h

a_l

start end

Figure [4] Phone substitution in the word " "

Our handling to this rule ( ) considers that both cases of pronouncing

the letter ( ) are represented by the same phone as the difference

appears usually in the vowel not the consonant., although there is a little bit difference

in their acoustic properties except in some few cases like the letter " " for example,

when it is pronounced with emphasis ( ) it becomes " ".

Building a suitable database covering all possible right and wrong phones is easy asmost of the phones in the Holy Quran are not new to ordinary Arabic speakers.

With this approach we can detect pronunciation errors for various rules other than this

rule ( ) like problems of pronouncing particular letter as "" and " "

for example and the rule of )( . But other rules like ( ) require a

different way in handling which we don't deal with in our system.

There is another approach depends on assessing the pronunciation quality and it may

tolerate more recognition noises [Witt and Young, 2000; Neumeyer et al., 2000;

Franco et al., 2000]. The judgment in this approach is usually required to correlate

well with human judges which makes it less objective and harder to implement than

our approach that asks for accurate and precise phoneme recognition.



22

3.2. Design

3.2.1 System Design

Knowing that our system is considered to be a model from which a bigger and a more

inclusive system which deal with different Qur’an recitation rules, the design of our

system had to be scalable and modular.

Based on the approach mentioned in the previous section, we decided to build our

model for teaching the rule of ( ) for 8 letters. We selected them from

the letters that learners can mispronounce in ( ), so that we can take this sura

as a test for the learner to measure his performance after learning.

A complete scenario explaining how the system works would be the best way to

present the system’s design.

Figure [5] System Design

The first screen appears to the user is the user login to know his profile and which

lessons he has learnt and which he has not. After that he takes a session for the new

lesson listening to an explanation of the rule to be learnt, and then he starts training by

some words from that lesson, according to the following scenario as shown in figure[5].

1- After being asked to repeat a word he has just listened for training, the user’s

utterance is perceived via the microphone to the GUI.

2- The utterance is saved in a .WAV file.

3- The file is passed by the GUI to the recognizer.

4- The recognizer performs the decoding process and passes the recognized word

as it is to the string comparator, correct be it or wrong.

5- The string comparator compares the recognized word with the reference word.

User ’s

Utterance

Utterancesaved

UtteranceFile

Recognizedword Feedback

Pronunciationdifference

1 3

2

4

5

8

7

6

GUI Recognizer

Recognized wordcompared withreference word

FeedbackGenerator

User Profile

Analyzer

StringComparator

Mistakes filtered

User’smistakes

Feedback 10

9

AuxiliaryDB



23

6- The output of the comparison (the pronunciation) is then passed to the User

Profile Analyzer.

7- The User Profile Analyzer checks the user profile and determines which

mistakes the user should receive feedback about depending on the lessons he

already passed.

8- The mistakes are then passed to the feedback generator.

9- The feedback generator generates the feedback and passes it to the GUI.

10- The GUI displays the feedback to user.

As the figure shows, the system consists of six main modules other than the GUI; the

following is a brief description of each:

1- Recognizer

After the user’s utterance is perceived through the microphone and saved in a .WAVfile, it is passed to an HMM-based phone-level recognizer along with a phone level

grammar file containing the phones of both the reference word and the expected

mistaken word and checks the utterance to determine which of them is closer to and

outputs a text file containing the phones of the recognized word.

2- Recognizer Interface

The recognizer runs on DOS shell which is not a user friendly interface especially in

feedback-oriented applications. So, an interface was built between our GUI and the

recognizer to overcome this problem.

3- String Comparator

The recognizer passes the phones of the recognized word to the string comparator

which has the reference word of the current lesson and compares them together and

passes the difference at every phone to the User Profile Analyzer.

4- User Profile Analyzer

After the user succeeds in a certain lesson, his profile is updated and this lesson is

added to his profile as in the following lessons he is expected not to commit a mistake

related to this learned lesson. For example: the user learned only the lesson teaching

( ), when he tries to recite ( ), he gets feedback only

related to his mistakes in ( ), if he learns and succeeds in another lesson

teaching ( ), and tries to recite ( ) again, he gets feedback related to his mistakes in both letters as they are both saved in his profile now.

5- Auxiliary Database

On starting a training or testing session, some values must be initialized to control the

session lifetime. For example, which word(s) will appear to the user to utter, what is

the reference transcription to that word(s)…etc. All of this information are stored in

this database.

6- Feedback Generator

After the mistakes are filtered according to the user’s knowledge, the feedback

generator analyzes the mistakes and determines the suitable method of guiding theuser to correcting them.



24

3.2.2 Database Design

As for the rule of ( ), we have chosen 8 letters which are present in

)( . A speech database was built covering the right and wrong pronunciations

of these letters to train the recognizer.

According to the lessons of Sheikh Ahmed Amer, we followed his methodology asordinary Arabic speakers by default pronounce the correct pronunciation of a letter in

some words while in others they couldn't despite their ignorance of the rules, as the

nature of the word itself forces the speaker to pronounce it correctly in some cases.

So, to cover both cases for each letter, the training database was chosen to contain

four words for each letter, two of which contained this letter read with the mistaken

pronunciation and the other two are other words containing this letter read with the

correct one. For example: for )( , we have the words: (

) the first two are usually mistaken and the user emphasizes ( ) whereas

in the last two, the letter is always pronounced correctly. Each speaker reads these

four words per letter three times. A list of all the words used for training can be found

in Appendix B.

3.2.3 Constraints

The design of our system was base on a few assumptions:

Calm environment

System functions with higher performance when used in a rather relatively calm

environment. Noise at a certain level can degrade the recognition accuracy.

Cooperative user

The word to be pronounced is displayed on the screen for the user, the user is

expected to either pronounce it correctly or mispronounce the letter being taught.The

system doesn’t deal with unexpected words.

Male user

This version of the system has been trained by only young male users voices, so in

order to serve female users, new models has to be constructed and trained by female

users voices.

3.3. Speech Recognition with HTK

In this project, several experiments were done using HTK v3.2.1 to build HMMs for

different recognizers starting from a small English digit recognizer to learn and testthe tool, then a small Arabic word recognizer as an attempt to handle Arabic speech

recognition, until building up the prototype and the speaker-dependent and speaker-

independent versions of the project core.

In this section, we explain the steps followed to build such recognizers. Details of

using each tool can be found in the HTK manual [18].

3.3.1 Data preparation

Recording the data

The first stage in developing a recognizer is building a speech database for training

and testing. Although HTK supports a tool (HSLab) for recording and labeling data,

we used another easier and user-friendly program for that purpose, which is Cool Edit



25

Pro v2. Speech is recorded via a desktop microphone and sampled in 16 bits at 16

kHz. It is saved in the Windows PCM format as a WAV file. Then Cool edit is used to

segment the data word by word giving each a distinct label.

Creating the Transcription Files

To train a set of HMMs, every file of training data must have an associated phonelevel transcription. This process was done manually by writing the phone level

transcription of each word, all in a single Master Label File (MLF) with the standard

HTK format.

Coding the data

Speech is then coded using the tool HCopy where the speech signal is processed first

by separating the signal into frames of 10ms length and then converting those frames

into feature vectors – MFCC coefficients in our case.

3.3.2 Creating Monophone HMMs

Creating Flat Start Monophones

The first step in HMM training is to create a prototype model defining the model

topology. The model usually in phone-level recognizers consists of three states plus

one entry and one exit states.

After that, An HMM seed is generated by the tool HCompV to initializes the

prototype model with a global mean and variance computed for all the frames in every

feature file. Also, this variance scaled with a factor (typically 0.01) is used as variance

flooring to set a floor on the variances estimated in the subsequent steps .

A copy of the previous seed is set in a Master Macro File (MMF) called hmmdefs as

the initialization for every model defined in the HMM list. This list contains all themodels that will be used in the recognition task, namely a model for each phone used

in the training data plus a silence model /sil/ for the start and end of every utterance.

Another file called macros is created that contains the variance floor macro and

defines the HMM parameter kind and the vector size

HMM parameters Re-estimation

The flat start monophones are re-estimated using the embedded training version of the

Baum–Welch algorithm which is performed by HERest tool. Whereby every model is

re-estimated with frames labeled with its corresponding transcription.

This tool is used for re-estimation after every modification in the models and usuallytwo or three iterations are performed every time.

Fixing silence models

Forward and backward skipping transitions are set in the silence model /sil/ to provide

it with a longer duration mean probability. This is performed by the HMM editor tool

HHEd.

3.3.3 Creating Tied-State Triphones

To provide some measure of context-dependency as a refinement for the models, we

create triphone HMMs where each phone model is represented by both a left and right

context.



26

First the label editor HLEd is used to convert the monophone transcriptions to an

equivalent set of triphone transcriptions. Then, the HMM editor HHEd is used to

create the triphone HMMs from the triphone list generated by HLEd. This HHEd

command has the effect of tying all of the transition matrices in each triphone set

where one or more HMMs share the same set of parameters.

The last step is using decision trees that are based on asking questions about the left

and right contexts of each triphone. Based on the acoustic differences between phones

according to the different classifications mentioned in section 2.3, phones are

clustered using these decision tress for more refinement. The decision tree attempts to

find those contexts which make the largest difference to the acoustics and which

should therefore distinguish clusters.

Decision tree state tying is performed by running HHEd using the QS command for

questions where the questions should progress from wide, general classifications

(such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone.

3.3.4 Increasing the number of mixture components

The early stages of triphone construction, particularly state tying, are best done with

single Gaussian models, but it is preferable for the final system to consist of multiple

mixture component context-dependent HMMs instead of single Gaussian HMMs

especially for speaker-independent systems. The optimal number of mixture

components can be obtained only by experiments by gradually increasing the number

of mixture components per state and monitoring the performance after each

experiment.

The tool HHEd is used for increasing mixture components with the command MU.

3.3.5 Recognition and evaluation

After training HMMs the tool HVite is used for Viterbi search in a recognition lattice

of multiple paths of the words to be recognized.

Our grammar file is written in the phone-level to provide multiple paths for phone

insertions, deletion or substitutions of the word to be recognized.It is written using the

BNF notation of the HTK, then the tool HParse is used to generate a lattice from this

grammar as an input to HVite.

As we used phone-level grammar, our dictionary is just a list of the phones used like

the following:

sil sila:_h a:_h

a:_l a:_l

… etc.

Note, if there are some paths in the grammar file make context-dependent triphones

that have no corresponding models in the HMMs, we copy the HMMs of the

monophones before tying and them to the final models instead of re-training the

models on the new triphones.

To test the recognizer performance, we run HVite on testing data where the output

transcriptions are written in a Master Label File for all testing files. We then use thetool HResults to compare this file with a reference file of the correct transcriptions



27

where it gives a percentage of the correct recognized words and phones and other

statistics.

3.4. Experiments and results

3.4.1 Prototype

In order to experiment our approach, we started with a prototype that distinguish

between only two pairs of words. The first is the wrong and right pronunciations of

the word )( , and the second is the two cases of the word ( ).

We have chosen these two words specifically to experience identifying pronunciation

errors of more than one phone at the same time as the learner is prone to

mispronounce the letters " ", " ", " " in these two words where the mispronunciation

appears in the vowel after each of them.

The grammar file of the recognizer is as follows:

( sil ( a:_h a_h h: b_h a_h t: a_h |

a:_l a_l h: b_l a_l t: a_h |

a:_h a_h b k_h aa_h r_h aa_h |

a:_l a_l b k_l aa_l r_h aa_h |

<sil> )

sil )

As the number of words is small, there is no need to make a grammar file for each

pair so we included them all in a single grammar. There is also a path in the network for a repated silence /sil/ to absorb any noise.

A speech database for a single speaker were recorded where each of the four words

was recorded 15 times.

When we trained and tested the HMMs with whole database it gave a 100%

recognition result.

But when we divided the data evenly on training and testing it gave a result of 96.67

% correct as only on word was misrecognized, which is an acceptable result.

3.4.2 Speaker-dependent system

After the prototype experiment, we started build our complete model as a speaker-

dependent system firstly.

Based on words of the database in Appendix B, speech data was recorded by a single

speaker 20 times; 5 of them are the correct pronunciations of the commonly

mispronounced words.

The results were 100% when the testing data was the whole training data.

When we divided the data between training and testing where the testing data is 20 %

of the whole database, it gave a result of 95.37 %.



28

3.4.3 Speaker-independent system

We started our experiment of the speaker-independent system with 13 adult male

speakers of almost the same age, most of them recording the speech data 3 times.

Speech data was only the 32 words in Appendix B without the 3 verses of (

).

With 3 mixture components per state, the result of the training data as the testing data

was 98.35 % accuracy plus other non-significant errors that do not affect on the

detecting the errors of ( ) like the error of substituting the letter " " with

" " or " " in the word ( ). Such an error was not fully detected as a little

number of speakers was uttering it like that in the training data which was not our

focus.

When testing with a separate test data of 6 speakers with 3 repetitions for each, the

results gave accuracy of 98.76 % also plus other non-significant errors.

When we trained the verses of ( ) it gave an accuracy of around 50% in both

cases of training them separately and training them with the other words of thedatabase where the training data is the same as the testing data.

A possible reason of this result is that each verse is not only one word as in the rest of

the training database but it is a full sentence that include he problems of continuous

speech recognition like more context-dependency which requires more training.

Beside that, some of the transcriptions of the data were a point of disagreement as it

was hard to decide what the right transcriptions of this data are, especially when it is

read quickly.

3.5 Implementation of Other ModulesAfter building the recognizer, other modules of the system were implemented using

Microsoft Visual C# .NET. The following is an explanation of this implementation to

each module in figure [5].

3.5.1 Recognizer interface

We did this function by the ProcessingAudio() method. The method starts a new

process for the recognizer, passes appropriate arguments, hides the DOS shell, and

receives recognizer's output.

If the recognizer succeeds in capturing the audio, it will return no messages, but it will

write the corresponding transcriptions (according to our phonology) in a file.

If the recognizer fails to capture the audio, the feedback generator will be fired, asking

the user to re-utter the word(s).

3.5.2 String Comparator

The transcription file generated by the recognizer will be read by the string

comparator which will compare between the recognized word and the reference

correct word.

In this transcription file, each phone is stored in a distinct line. So, as the same

manner of dealing with the user profile, the file will be read and stored in an array-list



29

(dynamic array) to facilitate comparison with the reference phones stored in another

array-list.

As mentioned before, there three kinds of pronunciation mistakes, insertion,

substitution, and deletion. So, the module was implemented with three methods,

CheckInsersion(), CheckSubstitution() , and CheckDeletion() .The core of the three

methods was implemented but only the CheckSubstitution() was completed and tested

where the only mistakes we handle now fall in this kind of mistakes.

The implementation of the CheckSubstitution() method is as follows, for all the

phones named 'fat-ha ' in the reference word, if it is different to the corresponding

phone in the recognized word(s) with respect to emphasis ( ), and the pervious

phone is in a passed lesson or in the current lesson (in case of training not testing),

then the feedback generator will be fired to report this mistake, otherwise the mistake

will be ignored.

By doing so, we are able to detect all mistakes, but we filter feedback according to the

user's status.

As it is observed, we search for the specific vowel phone 'fat-ha ' as we assumed

that the emphasized phone is the same as the un-emphasized one, and the difference

only appears on the vowels follow the consonant. This assumption is valid for most

consonants; other consonants out of this assumption are not handled in our project,

but can simply be done by adding some additional phones.

3.5.3 Auxiliary Database

For every lesson, the user’s utterance will be tested with two words. So, to decide

which word will appear to the user, and initializing the reference array-list (dynamic

array) for that word, the method TrainWhat() takes the lesson number and the word

number so that the session can be started and returns the corresponding word.

On generating the feedback, for each mistaken phone, we check its lesson to know

whether the user had passed this lesson or not and display the appropriate feedback.

The method GetLessonNo() implements this function using a simple switch case.

Also on generating feedback, we need to map between the phones resulting from

running the recognizer and the corresponding Arabic letters is needed, i.e. a de-

transcriptor. So, the method Corr_Arabic() implements this function by taking a

string represents the phone, passing through a switch case to return the corresponding

Arabic letter.

As observed, switch case is frequently used for simplicity of implementation and thesmall search space, but if the search space increased somewhat, there will be another

decision.

3.5.4 User Profile Analyzer

Two methods perform this analyzer, the first is ReadProfile() which is called at the

beginning of the training session or the testing session to give feedback only

according to lesson numbers stored in this profile (i.e. succeeded lessons by the user).

In case of training (not testing), the current lesson will be taken into consideration

besides the succeeded lessons. The second method is UpdateProfile() which is called

after the user has passed a certain lesson so that it can be considered afterwards.



30

The ReadProfile() implementation is as follows, the file is read line by line (where

every lesson number is stored in a distinct line), each lesson number is then added to

an array-list (dynamic array ) to facilitate searching within.

The UpdateProfile() implementation is as follows, the array-list created before in

ReadProfile() is searched to know whether the lesson has been passed before or not, if

not, it will be appended to the profile.

3.5.5 Feedback Generator

This module collects messages generated from the string comparator module and

displays them in a suitable method to guide the user through correcting them.

If no messages are collected, appropriate messages will be displayed for guiding the

user through completing the learning process.

The format of the reported message is some thing like:

>><<

Where words between angle brackets vary according to the letter and the type of

mistake occurred on uttering it. For instance, a mistake in uttering the word ‘ ’ will

produce a message like:

Where, the word ‘ ’ (emphasized) represent the type of the mistake in uttering the

letter ‘ ’.

An example of messages reporting correct utterance is as follows:

...

هللا ...

A final example given here is for guiding the user through correcting his mistakes;

messages of this type are like:

...

...

The method FeedbackOut() implements this module with the aid of the auxiliary

database.

3.3.6 GUI

Navigation through the project forms reaching to the scenario mentioned above is

controlled by the GUI.

Our GUI consists of six forms:

frmUserType: consists of two radio buttons and a command button so that the user can

select his type (registered/unregistered).

frmNewUser : if the user is unregistered, he will be brought to this form which

contains text box and command button. When the user enters his name, an empty new

profile is created for him.

frmOldUser : if the user is registered, he will be brought to this form which contains a

combo box (drop-down list) for registered users, and a command button. A search

process will be done on the directory contains users’ profiles to fill the combo box.



31

frmLessons: contains command buttons for choosing a lesson to listen to, and a

command button for performing a test.

frmListening: for playing the chose lesson in the frmLessons, so it contains a cassette-

like buttons for performing this function.

frmTraining: for train the user on the lesson he has heard and testing him.The scenario mentioned above takes place on this form, so it is the most important

form in the project.

One last thing to mention here is that the layout of these forms was drawn using

Microsoft PowerPoint.



32

Chapter 4: Conclusion and Future Work

4. CONCLUSION AND FUTURE WORK

In this work, we presented a computer-assisted pronunciation teaching system for a

class of the recitation rules of the Holy Quran. We needed to build background from

various disciplines to achieve this work which was very interesting.

Handling and detecting pronunciation errors as identifying phone insertions, deletions

and substitutions has been proven to be feasible and useful for a considerable class of

recitation rules. Extending the system to cover all the words of the Holy Quran can be

done by a procedure that automatically generate all possible phone paths that cover

the different pronunciations of the word together with robust HMMs trained on a

large database.

The HMM toolkit was an excellent tool for our experiments and it is really powerful

for more research in this area.

The main problem we encountered in our experiments was building the speech

database as not all speakers were pronouncing the words as we expected. And this led

to problems in writing the appropriate transcriptions for each utterance and some data

was totally rejected. But supervising the recording process for many volunteers was

hard to achieve and the time for that would be at least duplicated. Though, the results

are really satisfying and encourage us to continue in this field.

As for the future work of the system, we aim to experiment addressing another class

of recitation rules, namely ( ), by building a higher layer above the

recognizer to count the number of frames for the recognized phone as an indication of

the length the vowel or semi-vowel.

We also aim to investigate the possibility of making the system Web-based for

distance learning.



33

REFERENCES

[1] Eskinazi, M. "Detection of foreign speakers’ pronunciation errors for second

language training - preliminary results".

[2] WITT, S.M. and YOUNG, S. (1997) "Computer-assisted pronunciation teaching

based on automatic speech recognition", Language Teaching and Language

Technology.

http://svr-www.eng.cam.ac.uk/~smw24/ltlt.ps

[3] Delmonte, R. "A Prosodic Module for Self-Learning Activities".

http://www.lpl.univ-aix.fr/sp2002/pdf/delmonte.pdf

[4] Gu, L. and Harris, G. " SLAP: A System for the Detection and Correction of Pronunciation for Second Language Acquisition Using HMMs".

[5] Eskinazi, M. "Using Automatic Speech Processing for Foreign Language

Pronunciation Tutoring: Some Issues and a Prototype" .LLT Journal Vol. 2, No.

2, January 1999.

http://llt.msu.edu/vol2num2/article3/index.html

[6] Witt, S. Use of Speech recognition in Computer-assisted Language Learning. Phd

thesis,Cambridge University, 1999.

[7] Neri, A., Cucchiarini, C. and Strik, W. " Automatic Speech Recognition for second language learning: How and why it actually works".

[8] Survey of the State of the Art in Human Language Technology, Center of Spoken

Language Understanding.

http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html

[9] Padmanabhan, M. and Picheny, M. "Large-Vocabulary Speech Recognition

Algorithms", IEEE Computer Magazine, pp 42-50, April 2002.

[10] Ehsani, F. and Knodt, E. "Speech Technology in Computer-Aided Language

Learning: Strengths and Limitations of a New Call Paradigm" LLT Journal Vol.2, No. 1, July 1998.

http://polyglot.cal.msu.edu/llt/vol2num1/article3/

[11] Moreno, D. "Harmonic Decomposition Applied to Automatic Speech

Recognition".

[12] . ""

[13] . " " .

[14] Rabiner, L. and Juang, B. Funadamentals of Speech Recognition, Prentice Hall,1993.



34

[15] James F. Allen, "Signal Processing for Speech Recognition ", Lecture Notes of

CSC 248/448: Speech Recognition and Statistical Language Models Course, Fall

2003, University of Rochester.http://www.cs.rochester.edu/u/james/CSC248/Lec13.pdf

[16] Ra biner,L.; "A tutorial on hidden Markov models and selected applications in

speech recognition"; Proceedings of the IEEE, 1989; vol. 77, no. 2, pp. 257-286.

[17] Xue Wang; "Incorporating Knowledge on Segmental Duration in Hmm-Based

Continuous Speech Recognition" .http://www.fon.hum.uva.nl/wang/ThesisWangXue/chapter2.pdf

[18] Young, S. et al. (2002), The HTK book for version 3.2, Cambridge University.

http://htk.eng.cam.ac.uk/

[19] Kawai, G. and Hirose, K. "A method for measuring the intelligibility and

nonnativeness of phone quality in foreign language pronunciation training".



35

Appendix A: User Manual

A. USER MANUAL

When you run the program, the

first form you will meet carries

the title , if this is

your first time to use the

program then choose

(1)

In this case you will be

transferred to another form

where you enter your name and

new profile is created.Otherwise then you can choose

(2).

Then you will be transferred to

another form where you can

select your name from the drop

list (3) containing all registered

users.

After logging in, and at any

point in the program, you will

not lose sight of the button

(4) which enables you

to re-login as a different user.

The next form contains the list

of lesson to learn, titled with

letter which you will learn the

lesson (5).



36

Then you can listen to the

lesson with the voice of Sheikh

Ahmad Amer by choosing the

Play button (6), or you canreturn to the previous form by

choosing (7) or

(8) to test what you

learned with or

choose to start your teaching

session by choosing (9).

This takes you to the teaching

session form where you can

hear the correct pronunciationof the word (10) by pressing it

or start recording (11) your

reading for this word, when

you are done recording, the

feedback about you reading is

displayed (12) and you can

listen to your own reading by

pressing the Play button (13).



37

Appendix B: Training Database

B. TRAINING DATABASE

The following is a list of the words used for training, where for each letter, the first

two words are usually pronounced wrongly, and the last two words are pronounced

correctly.

LETTER WORDS



ARABIC SUMMARY

- -

.

.

.

.

.

.

.

speech lab - project report

Documents