speech lab - project report
TRANSCRIPT
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 1/44
Cairo University
Computer Engineering DepartmentGiza, 12613 EGYPT
Speech LabGraduation Project Report
Submitted by
Amr M. Medhat Sameh M. Serag Mostafa F. Mahmoud
In partial fulfillment of the B.Sc. Degree in Computer Engineering
Supervised by
Dr. Nevin M. Darwish
July 2004
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 2/44
ii
ABSTRACT
Speech has long been viewed as the future of computer interfaces, promising
significant improvements in ease of use and enabling the rise of variety of speech-
recognition-based applications. With the recent advances in speech recognitiontechnology, computer-assisted pronunciation teaching (CAPT) has emerged as a
tempting alternative to traditional methods of supplementing or replacing direct
student-teacher interaction.
Speech Lab is an Arabic pronunciation teaching system for teaching some of the Holy
Qur'an recitation rules. The objective is to detect learner's pronunciation errors and
provide diagnostic feedback. The heart of the system is a phone-level HMM-based
speech recognizer. The idea of comparing the learner's pronunciation with the correct
one of the teacher is based on identifying phone insertions, deletions or substitutions
resulting from the recognition of the learner's speech. In this work we focus on some
of the recitation rules targeting pronunciation problems of Egyptian learners.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 3/44
iii
ACKNOWLEDGEMENT
First and foremost, we would like to thank Dr. Salah Hamid from The Engineering
Company for the Development of Computer Systems (RDI) for his generous and
enthusiastic guidance. Without his insightful and constructive advices and supports,this project would have not been achieved. We are deeply grateful to him and also to
Waleed Nazeeh and Badr Mahmoud for their helpful support.
In this project we made use of a series of lessons for teaching the Holy Qur'an
recitation rules by Sheikh Ahmed Amer, we are so grateful to him for these wonderful
lessons. Beside using their content, they were of great help in drawing the
methodology we worked within in the project.
We are also grateful to all our friends who helped us by recording the data to build the
speaker-independent database. They were really cooperative and helpful. And special
thanks to the artists Mohammed Abdul-Mon'em, Mahmoud Emam and Mohammed
Nour for their wonderful work that added beauty and elegance to our project.
Special thanks must go also to Dr. Goh Kawai for providing us with his valuable
paper on pronunciation teaching.
Finally, we would like to thank our supervisor Dr. Nevin Darwish, our parents and all
who supported us. Thanks all and thanks God.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 4/44
iv
LIST OF ABBREVIATIONS
ASR Automatic Speech Recognition
CALL Computer-Assisted Language Learning
CAPT Computer-Assisted Pronunciation Teaching
EM Expectation Maximization
HMM Hidden Markov Model
HTK HMM Tool Kit
LPC Linear Predictive Coding
MFCC Mel Frequency Cepstral Coefficients
ML Maximum Likelihood
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 5/44
v
TABLE OF CONTENTS
ABSTRACT...................................................................................................................ii
ACKNOWLEDGEMENT........................................................................................... iiiLIST OF ABBREVIATIONS.......................................................................................iv
1. INTRODUCTION .....................................................................................................1
1.1 Motivation and Justification ................................................................................1
1.2 Problem definition ...............................................................................................1
1.3 Summary of Approach .........................................................................................1
1.4 Report overview...................................................................................................2
2. LITERATURE REVIEW ..........................................................................................3
2.1. Pronunciation Teaching ......................................................................................3
2.1.1 The Need for Automatic Pronunciation Teaching........................................3
2.1.2 Components of Pronunciation to Address ....................................................3
2.1.3 Previous Work ..............................................................................................4
2.1.4 Computer as a Teacher..................................................................................5
2.1.5 Components of an ASR-based Pronunciation Teaching System..................6
2.2 Speech Recognition .............................................................................................7
2.2.1 Speech Recognition System Characteristics.................................................7
2.2.2 Speech Recognition System Architecture.....................................................8
2.3 Phonetics and Arabic Phonology.......................................................................11
2.4 Speech Signal Processing ..................................................................................14
2.4.1 Feature Extraction.......................................................................................14
2.4.2 Building Effective Vector Representations of Speech................................15
2.5 HMM..................................................................................................................162.5.1 Introduction.................................................................................................16
2.5.2 Marcov Model.............................................................................................17
2.5.3 Hidden Markov Model................................................................................17
2.5.4 Speech recognition with HMM...................................................................18
2.5.5 Three essential problems.............................................................................19
2.5.6 Two important algorithms...........................................................................19
2.6 HTK ...................................................................................................................19
3. DESIGN AND IMPLEMENTATION ....................................................................21
3.1. Approach...........................................................................................................21
3.2. Design ...............................................................................................................22
3.2.1 System Design ............................................................................................223.2.2 Database Design..........................................................................................24
3.2.3 Constraints ..................................................................................................24
3.3. Speech Recognition with HTK .........................................................................24
3.3.1 Data preparation..........................................................................................24
3.3.2 Creating Monophone HMMs......................................................................25
3.3.3 Creating Tied-State Triphones....................................................................25
3.3.4 Increasing the number of mixture components...........................................26
3.3.5 Recognition and evaluation.........................................................................26
3.4. Experiments and results ....................................................................................27
3.4.1 Prototype.....................................................................................................27
3.4.2 Speaker-dependent system..........................................................................273.4.3 Speaker-independent system.......................................................................28
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 6/44
vi
3.5 Implementation of Other Modules.....................................................................28
3.5.1 Recognizer interface ...................................................................................28
3.5.2 String Comparator.......................................................................................28
3.5.3 Auxiliary Database......................................................................................29
3.5.4 User Profile Analyzer .................................................................................29
3.5.5 Feedback Generator ....................................................................................303.3.6 GUI .............................................................................................................30
4. CONCLUSION AND FUTURE WORK ................................................................32
REFERENCES ............................................................................................................33
A. USER MANUAL....................................................................................................35
B. TRAINING DATABASE .......................................................................................37
ARABIC SUMMARY.................................................................................................38
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 7/44
1
Chapter 1: Introduction
1. INTRODUCTION
1.1 Motivation and Justification
Teaching the Holy Qur'an recitation rules like pronunciation teaching generally, can
be repetitive, requiring drills and one-to-one attention that is not always available,
especially in large classes or if no teacher is available, plus it becomes very hard for
the teacher to detect the learners’ mistakes due to their large numbers.
Many systems have been developed for teaching people the recitation rules of the
Holy Qur’an, but the problem with such systems was that they lacked interaction as
they were based only on the user’s repetitive listening to the correct reading and
attempting to imitate it.Computer-assisted pronunciation teaching (CAPT) techniques are therefore attractive
as they allow self-paced practice outside a classroom, availability and real interaction
between the learner and the computer without the many-to-one problem of the class.
1.2 Problem definition
The idea in abstract way as shown in figure [1] is based on comparing the learner's
speech with the correct one and providing the learner with a feedback indicating the
place of the mispronunciation if any and guiding him to the correct pronunciation.
Learner's speech
System
Reference speech
Feedback
Figure [1] Schematic diagram of a typical CAPT system
The system mainly can be viewed as a problem of manipulating the learner's speech
with an automatic speech recognition (ASR) system to be able to compare it with thereference utterance and provide the proper feedback.
1.3 Summary of Approach
Most of the work done in this new application of ASR had the target of teaching
pronunciation to learners of a second language. There was also an attempt to build
CAPT system as a reading tutor for children.
One approach that has been used in detecting nonnative pronunciation characteristics
in foreign language speech sees differences between native and target language as
phone insertions, deletions and substitutions. So, a bilingual HMM-based phone
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 8/44
2
recognizer was used to identify pronunciations errors at the phone level, where
HMMs are trained on the phones of both languages.
In this project, we present a CAPT system for teaching some of the recitation rules of
the Holy Qur'an deploying a similar approach and targeting pronunciation problems
of Egyptian learners..
What makes our system a little bit different from others is that the learner in our case
is often able to pronounce all sounds in the Arabic language perfectly but he may not
be able to use the correct sounds (or phonemes) in the correct place when reading the
Holy Qur'an. So, the learner doesn't face the problems of foreign language learners
when they find new sounds in the new language don't exist in their native language
which is harder in training.
1.4 Report overview
We present in chapter 2 the background in which this project is undertaken and themain tool to be used in developing.
As for chapter 3, we explain the approach and design of our system and how we
implemented it.
We discuss in chapter 4 the conclusion to be made behind this project and the future
work we have in mind.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 9/44
3
Chapter 2: Literature Review
2. LITERATURE REVIEW
2.1. Pronunciation Teaching
2.1.1 The Need for Automatic Pronunciation Teaching
During the past two decades, the exercise of spoken language skills has received
increasing attention among educators. Foreign language curricula focus on productive
skills with special emphasis on communicative competence. Students' ability to
engage in meaningful conversational interaction in the target language is considered
an important, if not the most important, goal of second language education.
According to Eskinazi, the use of an automatic recognition system to help a user
improve his accent and pronunciation is appealing for at least two reasons: first, it
affords the user more practice time than a human teacher can provide, and second, the
user is not faced with the sometimes overwhelming problem of human judgment of
his production of “foreign” sounds [1]. To figure out more about the importance of
the system, it is important to recognize the specific difficulties encountered in
pronunciation teaching:
• Explicit pronunciation teaching requires the sole attention of the teacher to a
single student; this poses a problem in a classroom environment.
• Learning pronunciation can involve a large amount of monotonous repetition,
thus requiring a lot of patience and time from the teacher.
• With pronunciation being a psycho-motoric action, it is not only a mental task but also demands coordination and control over many muscles. Given the
social implications of the act of speaking it can also mean that students are
afraid to perform in the presence of others.
• In language tests the oral component is costly time consuming and subjective,
therefore an automatic method of pronunciation assessment is highly
desirable.
Additionally, all arguments for the usefulness of CALL systems apply here as well,
such as being available at all times and being cheaper.
All these reasons indicate that computer-based pronunciation teaching is not onlydesirable for self-study products but also for products which would complement the
teaching aids available to a language teacher [2].
2.1.2 Components of Pronunciation to Address
The accuracy of pronunciation is determined by both segmental and supra-segmental
features.
The segmental features are concerned with the distinguishable sound units of speech,
i.e. phonemes. A phoneme is also defined as "the smallest unit which can make a
difference in meaning". The set of phonemes of one language can be classified into
broad phonetic subclasses; for example, the most general classification as we will seein section 2.4 would be to separate vowels and consonants. Each language is
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 10/44
4
characterized by its distinctive set of phonemes. When learning a new language,
foreign students can divide the phonemes of the target language into two groups. The
first group contains those phonemes which are similar to the ones in his or her source
language. The second group contains those phonemes which do not exist in the source
language [6].
Teaching the pronunciation of segmental or phonetic features includes teaching the
correct pronunciation of phonemes and the co-articulation of phonemes into higher
phonological units, i.e., teaching phonemes pronunciation in isolation and in context
with other phonemes within words or sentences after that.
The supra-segmental features of speech are the prosodic aspects which comprise of
intonation, pitch, rhythm and stress. Teaching the pronunciation of prosodic features
includes teaching the following [3]:
• the correct position of stress at word level;
• the alternation of stress and unstressed syllables compensation and vowel
reduction;• the correct position of sentence accent;
• the generation of the adequate rhythm from of stress, accent, and phonological
rules;
• the generation of adequate intonational pattern utterance related to
communicative functions.
For beginners phonetic characteristics are of greater importance because these cause
mispronunciations. With increasing fluency more emphasis should be on teaching
prosody. But the focus here will be on teaching phonetics since teaching prosody
usually requires a different teaching approach.
2.1.3 Previous Work
Over the last decade several research groups have started to develop interactive
language teaching systems incorporating pronunciation teaching based on speech
recognition techniques. There was the SPELL project from [Hiller, 1993] which
concentrated on teaching pronunciation of individual words or short phrases plus
additional exercises for intonation, stress and rhythm. However, this system
concentrated on one sound at a time, for instance the pair "thin-tin" is used to train the
'th' sound, but it did not check whether the remaining phonemes in the word were
pronounced correctly.
Another early approach based on dynamic programming and vector quantization by
[Hamada, 1993] is likewise limited to word level comparisons between recordings of native and non-native utterances of a word. Therefore, their system required new
recordings of native speech for each new word used in the teaching system. This
system is called a text-dependent system in contrast to a text-independent one, where
the teaching material can be adjusted without additional recordings.
The systems described by [Bernstein, 1990] and [Neumeyer, 1996] were capable of
scoring complete sentences but not smaller units of speech.
The system used by [Rogers, 1994] was originally designed to improve the
intelligibility of hearing for speech impaired people. It was text-dependent and
evaluated isolated word pronunciations only.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 11/44
5
The system described by [Eskenazi 1996] was also text dependent and compared the
log-likelihood scores produced by a speaker independent recognizer of native and
non-native speech for a given sentence [2].
The European funded project ISLE [1998] is another example that aims to develop a
system that improves the English pronunciation of Italian and German native
speakers.
There is also the LISTEN project which is an inter-disciplinary research project at
Carnegie Mellon University to develop a novel tool to improve literacy - an
automated Reading Tutor that displays stories on a computer screen, and listens to
children read aloud.
Beside all these systems, there was also a work done to build some tools to support
this research in pronunciation assessment. Eduspeak [2000] by SRI International is an
example. It is a speech recognition toolkit that consists of a speech recognition
module and acoustical native and non-native models for adults and children. It also
has some score algorithms that make use of spectral matching and duration of sounds.
2.1.4 Computer as a Teacher
Success of an automatic pronunciation training system depends on how perfect it acts
as a human teacher in a classroom. The following are some issues to be considered in
a CAPT system to be able to assist or even replace teachers:
1. Evaluation
In pronunciation exercises there exists no clearly right or wrong answer. A large
number of different factors contribute to the overall pronunciation quality and these
are also difficult to measure. Hence, the transition from poor to good pronunciation isa gradual one and any assessment must also be presented on a graduated scale using a
scoring technique [2].
2. Integration into a complete educational system
For practical applications, any scoring method will have to be embedded within an
interactive language teaching system containing modules for error analysis,
pronunciation lessons, feedback and assessment. These modules can take results from
the core algorithm to give the student detailed feedback about the type of errors which
occurred, using both visual and audio information. For instance, in those cases where
a phoneme gets rejected because of a too poor score, the results of the phoneme loopindicate what has actually been recognized. This information can then be used for
error correction [2]. [Hiller 1996] presented a useful paradigm for a CALL
pronunciation teaching system called DELTA consisting of the four stages of
learning:
• Demonstrate the lesson audibly.
• Evaluate listening of the student ability with small tests.
• Teach with pronunciation exercises.
• Assess the progress made per lesson.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 12/44
6
3. Adaptive Feedback
A perfect CAPT system function is not just to tell the user blindly: well done or
wrong, repeat again! It should be more intelligent like an actual teacher.
In natural conversations, a listener may interrupt the talker to provide a correction or
simply point out the error. But the talker might not understand his message and askshis listener for a clarification. So, a correctly formed message usually results from an
ensuing dialogue in which meaning is negotiated.
Ideally teachers point out incorrect pronunciation at the right time, and refrain from
intervening too often in order to avoid discouraging the student from speaking. They
also intervene soon enough to prevent errors from being repeated several times and
from becoming hard-to-break habits [5].
So, a perfect system that acts as a real teacher should consider the following [4, 5]:
• Addressing the error precisely, so the part of the word that was mispronounced
should be precisely located within the word.
• The addressed error should be used to modify the native utterance so that the
mispronounced component is emphasized by being louder, longer and possibly
with higher pitch. The student then says the word again and the system
repeats.
• Correcting only when necessary, reinforcing good pronunciation, and avoiding
negative feedback to increase student's confidence.
• The pace of correction, that is, the maximum amount of interruptions per unit
of time that is tolerable, should be adapted to fit each student's personality;
since adaptive feedback is important to obtain better results from correction
and to avoid discouraging the student.
2.1.5 Components of an ASR-based Pronunciation Teaching System
The ideal ASR-based CAPT system can be described as a sequence of five phases, the
first four of which strictly concern ASR components that are not visible to the user,
while the fifth has to do with broader design and graphical user interface issues [7].
1. Speech recognition
The ASR engine translates the incoming speech signal into a sequence of words on
the basis of internal phonetic and syntactic models. This is the first and most
important phase, as the subsequent phases depend on the accuracy of this one. It isworth mentioning that a speaker-dependent system is more appropriate in teaching
foreign language pronunciation [2]. Details of this phase will be presented later.
2. Scoring
This phase makes it possible to provide a first, global evaluation of pronunciation
quality in the form of a score. The ASR system analyzes the spoken utterance that has
been previously recognized. The analysis can be done on the basis of a comparison
between temporal properties (e.g. rate of speech) and/or acoustic properties of the
student’s utterance on one side, and natives’ reference properties on the other side; the
closer the student’s utterance comes to the native models used as reference, the higher
the score will be..
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 13/44
7
3. Error detection
In this phase the system locates the errors in the utterance and indicates to the learner
where he made mistakes. This is generally done on the basis of so-called confidence
scores that represent the degree of certainty of the ASR system by matching the
recognized individual phones within an utterance with the stored native models that
are used as a reference.
4. Error diagnosis
The ASR system identifies the specific type of error that was made by the student and
suggests how to improve it, because a learner may not be able to identify the exact
nature of his pronunciation problem alone. This can be done by resorting to
previously stored models of typical errors that are made by non-native speakers.
5. Feedback presentation
This phase consists in presenting the information obtained during phases 2, 3, and 4 to
the student. It should be clear that while this phase implies manipulating the various
calculations made by the ASR system, the decisions that have to be taken here – e.g.
presenting the overall score as a graded bar, or as a number on a given scale – have to
do with design, rather than with the technological implementation of the ASR system.
This phase is fundamental because the learner will only be able to benefit from all the
information obtained by means of ASR if this is presented in a meaningful way.
2.2 Speech RecognitionSpeech recognition is the process of converting an acoustic signal, captured by a
microphone or a telephone, to a set of words. The recognized words can be the final
results, or they can serve as the input to further linguistic processing.
2.2.1 Speech Recognition System Characteristics
Speech recognition systems can be characterized by many parameters, some of the
more important of which are shown in table [1] below [8].
Table [1]: Typical parameters used to characterize the capability of speechrecognition systems
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 14/44
8
An isolated-word speech recognition system requires that the speaker pause briefly
between words, whereas a continuous speech recognition system does not.
Spontaneous, or extemporaneously generated, speech contains disfluencies, and it is
much more difficult to recognize than speech read from script. Some systems require
speaker enrollment where a user must provide samples of his or her speech before
using them, whereas other systems are said to be speaker-independent, in that noenrollment is necessary. Some of the other parameters depend on the specific task.
Perplexity indicates the language’s branching power, with low-perplexity tasks
generally having a lower word error rate. Recognition is generally more difficult
when vocabularies are large or have many similar-sounding words. Finally, there are
some external parameters that can affect speech recognition system performance,
including the characteristics of the background noise (Signal to Noise Ratio) and the
type and the placement of the microphone [8].
2.2.2 Speech Recognition System Architecture
The process of speech recognition starts with a sampled speech signal. This signal has
a good deal of redundancy because the physical constraints on the articulators that
produce speech - the glottis, tongue, lips, and so on - prevent them from moving
quickly. Consequently, the ASR system can compress information by extracting a
sequence of acoustic feature vectors from the signal. Typically, the system extracts a
single multidimensional feature vector every 10 ms that consists of 39 parameters.
Researchers refer to these feature vectors, which contain information about the local
frequency content in the speech signal, as acoustic observations because they
represent the quantities the ASR system actually observes. The system seeks to infer
the spoken word sequence that could have produced the observed acoustic sequence.
[9]
It is assumed that ASR system knows the speaker’s vocabulary previously. Thisrestricts the search for possible word sequences to words listed in the lexicon, which
lists the vocabulary and provides phoneme s for the pronunciation of each word.
Language constraints are also used to dictate whether the word sequences are equally
likely to occur [9]. Training data are used to determine the values of the language and
phone model parameters
The dominant recognition paradigm is known as hidden Markov models (HMM). An
HMM is a doubly stochastic model, in which the generation of the underlying
phoneme string and the frame-by-frame, surface acoustic realizations are both
represented probabilistically as Markov processes. Neural networks have also been
used to estimate the frame based scores; these scores are then integrated into HMM- based system architectures, in what has come to be known as hybrid systems or
Hybrid HMM [8].
An interesting feature of a frame-based HMM systems is that speech segments are
identified during the search process, rather than explicitly. An alternate approach is to
first identify speech segments, then classify the segments and use the segment scores
to recognize words. This approach has produced competitive recognition performance
in several tasks [8]. Our system will be an HMM-based one.
The speech recognition process as a whole can be seen as a system of five basic
components as in figure [2] below: (1) an acoustic signal analyzer which computes a
spectral representation of the incoming speech; (2) a set of phone models (HMMs)
trained on large amounts of actual speech data; (3) a lexicon for converting sub-word
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 15/44
9
phone sequences into words; (4) a statistical language model or grammar network that
defines the recognition task in terms of legitimate word combinations at the sentence
level; (5) a decoder, which is a search algorithm for computing the best match
between a spoken utterance and its corresponding word string [10].
Figure [2]: Components of a typical speech recognition system.
1. Signal Analysis
The first step which will be presented in detail later, consists of analyzing the
incoming speech signal. When a person speaks into an ASR device __usually through
a high quality noise-canceling microphone__ the computer samples the analog input
into a series of 16 or 8-bit values at a particular sampling frequency (usually 16 KHz).
These values are grouped together in predetermined overlapping temporal intervals
called "frames". These numbers provide a precise description of the speech signal's
amplitude. In a second step, a number of acoustically relevant parameters such as
energy, spectral features, and pitch information, are extracted from the speech signal.
During training, this information is used to model that particular portion of the speech
signal. During recognition, this information is matched against the pre-existing model
of the signal [10].
2. Phone Models
The second module is responsible for training a machine to recognize spoken
language amounts by modeling the basic sounds of speech (phones). An HMM canmodel either phones or other sub-word units or it can model words or even whole
sentences. Phones are either modeled as individual sounds, so-called monophones, or
as phone combinations that model several phones and the transitions between them
(biphones or triphones). After comparing the incoming acoustic signal with the
HMMs representing the sounds of language, the system computes a hypothesis based
on the sequence of models that most closely resembles the incoming signal. The
HMM model for each linguistic unit (phone or word) contains a probabilistic
representation of all the possible pronunciations for that unit.
Building HMMs in the training process, requires a large amount of speech data of the
type the system is expected to recognize [10].
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 16/44
10
3. Lexicon
The lexicon, or dictionary, contains the phonetic spelling for all the words that are
expected to be observed by the recognizer. It serves as a reference for converting the
phone sequence determined by the search algorithm into a word. It must be carefully
designed to cover the entire lexical domain in which the system is expected to perform. If the recognizer encounters a word it does not "know" (i.e., a word not
defined in the lexicon), it will either choose the closest match or return an out-of-
vocabulary recognition error. Whether a recognition error is registered as
misrecognition or an out-of-vocabulary error depends in part on the vocabulary size.
If, for example, the vocabulary is too small for an unrestricted dictation task (let's say
less than 3K) the out-of-vocabulary errors are likely to be very high. If the vocabulary
is too large, the chance of misrecognition errors increases because with more similar-
sounding words, the confusability increases. The vocabulary size in most commercial
dictation systems tends to vary between 5K and 60K [10].
4. The Language Model
The language model predicts the most likely continuation of an utterance on the basis
of statistical information about the frequency in which word sequences occur on
average in the language to be recognized. For example, the word sequence "A bare
attacked him" will have a very low probability in any language model based on
standard English usage, whereas the sequence "A bear attacked him" will have a
higher probability of occurring. Thus the language model helps constrain the
recognition hypothesis produced on the basis of the acoustic decoding just as the
context helps decipher an unintelligible word in a handwritten note. Like the HMMs,
an efficient language model must be trained on large amounts of data, in this case
texts collected from the target domain.
In ASR applications with constrained lexical domain and/or simple task definition, the
language model consists of a grammatical network that defines the possible word
sequences to be accepted by the system without providing any statistical information.
This type of design is suitable for pronunciation teaching applications in which the
possible word combinations and phrases are known in advance and can be easily
anticipated (e.g., based on user data collected with a system pre-prototype). Because
of the a priori constraining function of a grammar network, applications with clearly
defined task grammars tend to perform at much higher accuracy rates than the quality
of the acoustic recognition would suggest [10].
5. Decoder
The decoder is an algorithm that tries to find the utterance that maximizes the
probability that a given sequence of speech sounds corresponds to that utterance. This
is a search problem, and especially in large vocabulary systems careful consideration
must be given to questions of efficiency and optimization, for example to whether the
decoder should pursue only the most likely hypothesis or a number of them in parallel
(Young, 1996). An exhaustive search of all possible completions of an utterance
might ultimately be more accurate but of questionable value if one has to wait two
days to get a result. Therefore there are some trade-offs to maximize the search results
while at the same time minimizing the amount of CPU and recognition time.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 17/44
11
2.3 Phonetics and Arabic Phonology
Phonetics studies all the sound of speech; trying to describe how they are made, to
classify them and to give some idea of their nature. Phonetic investigation shows that
human beings are capable of producing an enormous number of speech sounds,
because the range of articulatory possibilities is vast, although each language uses
only some of the sounds that are available [11].Even more importantly, each language
organizes and makes use of the sounds in its own particular way.
The study of the selection that each language makes from the vast range of possible
speech sounds and of how each language organizes and uses the selection it makes is
called Phonology. In other words, Phonetics describes and classifies the speech
sounds and their nature while Phonology studies how they work together and how
they are used in a certain language where differences among sounds serve to indicate
distinctions of meaning [11].
Obviously, not all the differences between speech sounds are significant, and not only
this, the difference between two speech sounds can be significant in one language but
not in another one. A list of sounds whose differences from one another are
significant can be built up by making a comparison between words of the same
language. These significant or distinctive sounds are the elements of the sound system
and are known as phonemes. Whereas the different sounds that not make any
difference are know as allophones [11].
In Arabic, there are 37 distinct phonemes [12], but when it comes to the Holy Qur'an,
the nature of its rules requires defining a new set of phonemes where distinguishing
between the correct and wrong way of reading in some rules cannot be detected using
only the standard set of Arabic phonemes. Below is the set of phonemes we definedfor some – not all – of the Qur'an phonemes that we needed for the rules we teach in
our system.
Index Phoneme Notation
1 /a:/
2 /b/
3 /t/
4 )( /th/
5 ) (
6 )( /dj/
7 )( /g/8 )( /j/
9 /h:/
10 /kh/
11 /d/
12 )( /dh/
13 ) (
14 /r/
15 /z/
16 /s/
17 /sh/18 /s:/
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 18/44
12
19 /d:/
20 /t:/
21 )( /zh:/
22 ) ( /z:/
23 /e/
24 /gh/25 /f/
26 /q/
27 /k/
28 /l/
29 /m/
30 /n/
31 /h/
32 /w/
33 /y/
3 /a_l/35 /a_h/
36 /i/
37 /u/
38 /aa_h/
39 /aa_l/
40 /uu/
41 /ii/
Table [2] Phoneme set
If we come to phonetics, we will find lots of classifications of Arabic speech sounds,
the following is a list of theses different classifications based on three different bases.
[12, 13, 14].
In the human speech production process, the most basic way to classify speech sounds
is to separate them into two groups of vowels and consonants according to whether or
not they involve significant constriction of the vocal tract .
• Vowels: -----
• Consonants: the rest of Arabic letters.
The second basis of classification is according to voicing properties Arabic phonemes
can be classified into:• Glottal stop ( ):
• Unvoiced: -----------
• Voiced: the rest of Arabic letters.
The third classification is according to place of articulation:
• From Larynx ( ): -
• From Throat ( ): –
• Velar ( - ):
• From soft-palate ( ): ---
• From hard-palate ( ): – -
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 19/44
13
• From Gum ( ): --
• Alveolar ( - ): ------
• Dental ( ): --
• Labiodental ( ) :
• Bilabial ( ): -
There is also another secondary classification according to the properties of some
particular phonemes:
• Emphasis ( ): ------
Where, High emphasis phonemes are: --- and Semi-
emphasized phonemes are: --
• Sibilant( ): ---
• Extent ( ):
• Spread ( ):
• Deviation ( ): -
• Unrest ( ): ----
• Snuffle( ): -
For the first classification of consonants and vowels, phonemes can be further
classified to the following subclasses:
Consonants are classified according to the manner of articulation to:
• Plosives or Stops( ): ------
--
• Fricatives ( ): ----------
----• Laterals ( ):
• Trills ( ):
• Affricates ( ):
• Nasals ( ): –
• Glides ( ): –
• Liquids ( ): ---
Vowels on the other hand have different classifications; the first is according to the
tongue hump position:
• Back: -• Mid: -
• Front: -
Another classification of vowels:
• Long vowels: - -
• Short vowels: --
A third classification of vowels:
• IVowels: -
• UVowels: -
• AVowels: -
The last classification for vowels is according to lip rounding:
• With rounding lips: -
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 20/44
14
• Without: ---
These classifications will be useful later in the process grouping the phonemes with
similar properties in building a recognizer.
2.4 Speech Signal Processing
2.4.1 Feature Extraction
Once a signal has been sampled, we have huge amounts of data, often 16,000 16 bit
numbers a second! We need to find ways to concisely capture the properties of the
signal that are important for speech recognition before we can do much else. Probably
the most important parametric representation of speech is the spectral representation
of the signal, as seen in a spectrogram1 which contains much of the information we
need. We can obtain the spectral information from a segment of the speech signal
using an algorithm called the Fast Fourier Transform. But even a spectrogram is far too complex a representation to base a speech recognizer on. This section describes
some methods for characterizing the spectra in more concise terms [15].
Filter Banks: One way to more concisely characterize the signal is by a filter bank.
We divide the frequency range of interest (say 100-8000Hz) into N bands and
measure the overall intensity in each band. This could be computed from spectral
analysis software such as the Fast Fourier Transform). In a uniform filter bank , each
frequency band is of equal size. For instance, if we used 8 ranges, the bands might
cover the frequency ranges: 100Hz-1000Hz, 1000Hz-2000Hz, 2000Hz-3000Hz, ...,
7000Hz-8000Hz.
But, is it a good representation? We’d need to compare the representations of differentvowels for example and see whether the vector reflects differences in these vowels or
not. If we do this, we’ll see there are some problems with a uniform filter bank. So, a
better alternative is to organize the ranges using a logarithmic scale. Another
alternative is to design a non-uniform set of frequency bands that has no simple
mathematical characterization but better reflects the responses of the ear as
determined from experimentation. One very common design is based on perceptual
studies to define critical bands in the spectra. A commonly used critical band scale is
called the Mel scale which is essentially linear up to 1000 Hz and logarithmic after
that. For instance, we might start the ranges at 200 Hz, 400 Hz, 630 Hz, 920 Hz, 1270
Hz, 1720 Hz, 2320 Hz, and 3200 Hz.
LPC: A different method of encoding a speech signal is called Linear Predictive
Coding (LPC). The basic idea of LPC is to represent the value of the signal over some
window at time t, s(t) in terms of an equation of the past n samples, i.e.,
1
A spectrogram is an image that represents the time-varying spectrum of a signal. The x-axisrepresents time, the y-axis frequency and the pixel intensity represents the amount of energy infrequency band y, at time x.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 21/44
15
Of course, we usually can’t find a set of a i’s that give an exact answer for every
sample in the window, so we must settle for the best approximation, s(t), that
minimizes the error.
MFCC: Another technique that has proven to be effective in practice is to compute a
different set of vectors based on what are called the Mel Frequency Cepstral
Coefficients (MFCC). These coefficients provide a different characterization of the
spectra than filter banks and work better in practice. To compute these coefficients,
we start with a filter bank representation of the spectra. Since we are using the banks
as an intermediate representation, we can use a larger number of banks to get a better
representation of the spectra. For instance, we might use a Mel scale over 14 banks
(ranges starting at 200, 260, 353, 493, 698, 1380, 1880, 2487, 3192, 3976, 4823,
5717, 6644, and 7595). The MFCCs are then computed using the following formula:
where N is the desired number of coefficients. What this is doing is computing aweighted sum over the filter banks based on a cosine curve. The first coefficient, c0, is
simply the sum of all the filter banks, since i = 0 makes the argument to the cosine
function 0 throughout, and cos(0)=1. In essence it is an estimate of the overall
intensity of the spectrum weighting all frequencies equally. The coefficient c1 uses a
weighting that is one half of a cosine cycle, so computes a value that compares the
low frequencies to the high frequencies. The function for c2 is one cycle of the cosine
function, while for c3 it is one and a half cycles, and so on.
2.4.2 Building Effective Vector Representations of Speech
Whether we use the filter bank approach, the LPC approach or any other approach, we
end up with a small set of numbers that characterize the signal. For instance, if we
used the Mel-scale with dividing the spectra into 7 frequency ranges, we have reduced
the representation of the signal over the 20 ms segment to a vector consisting of eight
numbers. With a 10 ms shift in each segment, we are representing the signal by one of
these vectors every 10 ms. This is certainly a dramatic reduction in the space needed
to represent the signal. Rather than 16,000 numbers per second, we now represent the
signal by 700 numbers a second!
Just using the six spectral measures, however, is not sufficient for large-vocabulary
speech recognition tasks. Additional measurements are often taken that capture
aspects of the signal not adequately represented in the spectrum. Here are a fewadditional measurements that are often used:
Power: It is a measure of the overall intensity. If the segment Sk contains N samples
of this signal, s(0),..., s(N-1), then the power power(Sk ) is computed as
following:
Power(Sk ) = Σi=1,N-1 s(i)^2.
An alternative that doesn’t create such a wide difference between low and soft sounds
uses the absolute value:
Power(Sk ) = Σi=1,N-1 |s(i)|.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 22/44
16
One problem with direct power measurements is that the representation is very
sensitive to how loud the speaker is speaking. To adjust for this, the power can be
normalized by an estimate of the maximum power. For instance, if P is the maximum
power within the last 2 seconds, the normalized power of the new segment would be
power(Sk )/P. The power is an excellent indicator of the voiced/unvoiced distinction,
and if the signal is especially noise-free, can be used to separate silence from lowintensity speech such as unvoiced fricatives. But we don't need it in the MFCC since
the power is estimated well by the c0 coefficient.
Power Difference: The spectral representation captures the static aspects of a signal
over the segment, but we have seen that there is much information in the transitions in
speech. One way to capture some of this is to add a measure to each segment that
reflects the change in power surrounding it. For instance, we could set:
PowerDiff(Sk )= power(Sk +1)-power(Sk -1).
Such a measure would be very useful for detecting stops.
Spectral Shifts: Besides shifts in overall intensity, we saw that frequency shifts in the
formants can be quite distinctive, especially in looking at the effects of consonants
next to vowels. We can capture some of this information by looking at the difference
in the spectral measures in each frequency band. For instance, if we have eight
frequency intensity measures for segment Sk , f k (1),...,f k (8), then we can define the
spectral change for each segment as with the power difference, i.e., df k (i) = f k -1(i)-
f k +1(i)
With all these measurements, we would end up with 18-number vector, the 8 spectral
band measures, eight spectral band differences, the overall power and the power difference. This is a reasonable approximation of the types of representations used in
current state-of-the-state speech recognition systems. Some systems add another set of
values that represent the “acceleration”, and would be computed by calculating the
differences between the df k values.
2.5 HMM
2.5.1 Introduction
A hidden Markov model (HMM) is a stochastic generative process that is particularly
well suited to modeling time-varying patterns such as speech. HMMs represent
speech as a sequence of observation vectors derived from a probabilistic function of a
first-order Markov chain. Model ‘states’ are identified with an output probability
distribution that describes pronunciation variations, and states are connected by
probabilistic ‘transitions’ that capture durational structure. An HMM can thus be
used as a ‘maximum likelihood classifier’ to compute the probability of a sequence of
words given a sequence of acoustic observations using Viterbi search. The basics of
HMM will be discussed in the following sub-subsections. More information can be
found in [14, 16 and 17].
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 23/44
17
2.5.2 Marcov Model
In order to understand the HMM, we must first look at a Markov model and a
stochastic process in general. A stochastic process specifies certain probabilities of
some events and the relations between the probabilities of the events in the same
process at different times. A process is called Markovian if the probability at one time
is only conditioned on a finite history. Therefore, a Markov model is defined as a
finite state machine which changes state once every time unit. State is a concept used
to help understand the time evolution of a Markov process. Being in a certain state at
a certain time is then the basic event in a Markov process. A whole Markov process
thus produces a sequence of states S= s1, s2 … sT.
2.5.3 Hidden Markov Model
The HMM is an extension of a Markov process. A hidden Markov model can be
viewed as a Markov chain, where each state generates a set of observations. You only
see the observations, and the goal is to infer the hidden state sequence. For example,the hidden states may represent words or phonemes, and the observations represent
the acoustic signal. Figure [3] shows an example of such a process where the six state
model moves through the state sequence S = 1; 2; 2; 3; 4; 4; 5; 6 in order to generate
the sequence o1 to o6.
Figure [3] The Markov Generation Model
Each time t that a state j is entered, a speech vector ot is generated from the
probability density b j(ot ). Furthermore, the transition from state i to state j is also
probabilistic and is governed by the discrete probability aij .
Thus, we can see that the stochastic process of an HMM is characterized by two set of
probabilities.
The first set is the transition probabilities and are defined as:
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 24/44
18
This can also be written in matrix form A ={ aij }. For the Markov process itself,
when the previous state is known, there is a certain probability to transit to each of the
other states.
The second is the observation probability where the speech signal is converted into a
time sequence of observation vectors ot defined in an acoustic space. The sequence of
vectors is called an observation sequence O= o1 , o2 .. oT with each ot a staticrepresentation of speech at t . The observation probability is defined as:
with its matrix form B ={bj }.
The composition of the parameters Μ = ( A, B) defines an HMM. (In the HMM
literature there is another set of parameters, the probability that the HMM starts at
initial time Π= { π j }). The model becomes λ = ( A, B, Π) depending on three
parameters. However, for cases like ours where all the HMM always start at the first
state, s0 1 = , this Π can be included in A.
2.5.4 Speech recognition with HMM
The basic way of using HMM in speech recognition is to model different well defined
phonetic units wl (e.g., words or sub-word units or phonemes) in an inventory { wl }
for the recognition task, with a set of HMMs (each with parameter λl ). To recognize a
word wk from an unknown O is to find basically:
The probability P is usually calculated indirectly using Bayes' rule:
Here P (O) is constant for a given O over all possible wl . The a priori probability P (wl
) only concerns the language model of the given task, which we assume here to be
constant too. Then the problem of recognition is converted to calculation of P (O | wl ).
But we use λl to model wl , therefore we actually need to calculate P (O | λl ).
We can see that the joint probability of O and S being generated by the model λ can
be calculated as following:
Where, the transitions occurring at different times and in different states areindependent and therefore:
And also for a given state S , the observation probability is:
However, in reality, the state sequence S is unknown. Then one has to sum the
probability P (O , S | λ) over all S in order to get P (O | λ) .
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 25/44
19
2.5.5 Three essential problems
In order to use HMM in ASR, a number of practical problems have to be solved.
1. The evaluation problem: One has to evaluate the value P (O | λ) given only O
and λ, but not S . Without an efficient algorithm, one has to sum over nT possible S
with a total of 2T. n
T
⋅calculations, which is impractical.2. The estimation problem: The values of all λl in a system have to be determined
from a set of sample data. This is called training . The problem is how to get an
optimal set of λl that leads to the best recognition result, given a training set.
3. The decoding problem: Given a set of well trained λl and an O with an unknown
identity, one has to find P l (O | λ) for all λl . In the recognition process, for each
single λl , one hopes, instead of summing over all S , to find a single S M that is most
likely associated with O. S M also provides the information of boundaries between
the concatenated phonetic or linguistic units that are most likely associated with
O. The term decoding refers to finding the way that O is coded onto S . In both the
training and recognition processes of a recognition system, problem 1 is involved.
2.5.6 Two important algorithms
The two important algorithms that solve the essential problems are both named after
their inventors: the Baum-Welch algorithm (Baum et al., 1970) for parameter
estimation in training, and the Viterbi algorithm for decoding in recognition (in some
recognizers the Viterbi algorithm is also used for training).
The essential part of the Baum-Welch algorithm is a so-called expectation-
maximization (EM) procedure, used to overcome the difficulty of incomplete
information about the training data (the unknown state sequence). In the most
commonly used implementation of the EM procedure for speech recognition, a
maximum-likelihood (ML) criterion is used. The solutions for the ML equations givethe closed-form formulae for updating HMM parameters given their old values In
order to obtain good parameters, a good initial set of parameters is essential, since the
Baum-Welch algorithm only gives a solution for a local optimum. However, for
speech recognition, such a solution often leads to sufficiently well performance.
The basic shortcoming of the ML training is that maximizing the likelihood that the
model parameters generate the training observations is not directly related to the
actual goal of reducing the recognition error, which is to maximize the discrimination
between the classes of patterns in speech.
The Viterbi algorithm essentially avoids searching through an unmanageably large
space of HMM states to find the most likely state sequence S M by using step-wiseoptimal transitions. In most cases, the state sequence S M yields satisfactory results for
recognition. But in other cases, S M does not give rise to state sequence corresponding
to the most correct words.
2.6 HTK
One of the optimal tools for speech recognition research is the HMM Toolkit,
abbreviated as HTK. It is a well-known and free toolkit for use in research into
automatic speech recognition and other pattern recognition systems such as hand-
writing recognition and facial recognition. It has been developed by the Speech
Vision Robotics Group at the Cambridge University Engineering Department andEntropic Ltd [18].
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 26/44
20
The toolkit consists of a set of modules for building Hidden Markov Models (HMMs)
which can be called from both command line and script file(s). The following are their
main functions:
1. Receiving audio input from the user.
2. Coding the audio files.
3. Building the grammar and dictionary for the application.
4. Attaching the recorded utterances to their corresponding transcriptions.
5. Building the HMMs.
6. Adjusting the parameters of the HMMs using the training sets.
7. Recognizing the user's speech using the Viterbi algorithm.
8. Comparing the testing speech patterns with the reference speech patterns.
In the actual processing, the HTK firstly parameterizes features of speech data to
various forms such as Linier Predictive Coding (LPC) and Mel-Cepstrum. Then, it
will estimate the HMM parameters using the Baum-Welch Algorithm for training.
Recognition tests are executed by estimating the best hypothesis from given featurevectors and from a language model using the Viterbi algorithm which finds the
maximum likelihood state sequence. Results are given with recognition percentage as
well as numbers of deletion, substitution and insertion errors.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 27/44
21
Chapter 3: Design and Implementation
3. DESIGN AND IMPLEMENTATION
3.1. Approach
The approach we adopted in our system considers the systemic and structural
differences between the learner's utterance and the correct utterance as phone
insertions, deletions and substitutions [19].
This requires a phone recognizer trained on the correct phones and the wrong ones
that may be inserted or substituted by the learner. Knowledge of phonetics, phonology
and pedagogy is needed to know the different possible mispronunciations of each
phone.An example of phone substation problem of the word " " is shown in figure [4]
where usually learners encounter the problem of the emphatic pronunciation of the
first letter ( ) in this word which appears in the vowel after it. So, the
correct phone /a_l/ may be replaced with /a_h/ (see the phonology table in section
2.3).
n r s:
a_h
a_l
start end
Figure [4] Phone substitution in the word " "
Our handling to this rule ( ) considers that both cases of pronouncing
the letter ( ) are represented by the same phone as the difference
appears usually in the vowel not the consonant., although there is a little bit difference
in their acoustic properties except in some few cases like the letter " " for example,
when it is pronounced with emphasis ( ) it becomes " ".
Building a suitable database covering all possible right and wrong phones is easy asmost of the phones in the Holy Quran are not new to ordinary Arabic speakers.
With this approach we can detect pronunciation errors for various rules other than this
rule ( ) like problems of pronouncing particular letter as "" and " "
for example and the rule of )( . But other rules like ( ) require a
different way in handling which we don't deal with in our system.
There is another approach depends on assessing the pronunciation quality and it may
tolerate more recognition noises [Witt and Young, 2000; Neumeyer et al., 2000;
Franco et al., 2000]. The judgment in this approach is usually required to correlate
well with human judges which makes it less objective and harder to implement than
our approach that asks for accurate and precise phoneme recognition.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 28/44
22
3.2. Design
3.2.1 System Design
Knowing that our system is considered to be a model from which a bigger and a more
inclusive system which deal with different Qur’an recitation rules, the design of our
system had to be scalable and modular.
Based on the approach mentioned in the previous section, we decided to build our
model for teaching the rule of ( ) for 8 letters. We selected them from
the letters that learners can mispronounce in ( ), so that we can take this sura
as a test for the learner to measure his performance after learning.
A complete scenario explaining how the system works would be the best way to
present the system’s design.
Figure [5] System Design
The first screen appears to the user is the user login to know his profile and which
lessons he has learnt and which he has not. After that he takes a session for the new
lesson listening to an explanation of the rule to be learnt, and then he starts training by
some words from that lesson, according to the following scenario as shown in figure[5].
1- After being asked to repeat a word he has just listened for training, the user’s
utterance is perceived via the microphone to the GUI.
2- The utterance is saved in a .WAV file.
3- The file is passed by the GUI to the recognizer.
4- The recognizer performs the decoding process and passes the recognized word
as it is to the string comparator, correct be it or wrong.
5- The string comparator compares the recognized word with the reference word.
User ’s
Utterance
Utterancesaved
UtteranceFile
Recognizedword Feedback
Pronunciationdifference
1 3
2
4
5
8
7
6
GUI Recognizer
Recognized wordcompared withreference word
FeedbackGenerator
User Profile
Analyzer
StringComparator
Mistakes filtered
User’smistakes
Feedback 10
9
AuxiliaryDB
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 29/44
23
6- The output of the comparison (the pronunciation) is then passed to the User
Profile Analyzer.
7- The User Profile Analyzer checks the user profile and determines which
mistakes the user should receive feedback about depending on the lessons he
already passed.
8- The mistakes are then passed to the feedback generator.
9- The feedback generator generates the feedback and passes it to the GUI.
10- The GUI displays the feedback to user.
As the figure shows, the system consists of six main modules other than the GUI; the
following is a brief description of each:
1- Recognizer
After the user’s utterance is perceived through the microphone and saved in a .WAVfile, it is passed to an HMM-based phone-level recognizer along with a phone level
grammar file containing the phones of both the reference word and the expected
mistaken word and checks the utterance to determine which of them is closer to and
outputs a text file containing the phones of the recognized word.
2- Recognizer Interface
The recognizer runs on DOS shell which is not a user friendly interface especially in
feedback-oriented applications. So, an interface was built between our GUI and the
recognizer to overcome this problem.
3- String Comparator
The recognizer passes the phones of the recognized word to the string comparator
which has the reference word of the current lesson and compares them together and
passes the difference at every phone to the User Profile Analyzer.
4- User Profile Analyzer
After the user succeeds in a certain lesson, his profile is updated and this lesson is
added to his profile as in the following lessons he is expected not to commit a mistake
related to this learned lesson. For example: the user learned only the lesson teaching
( ), when he tries to recite ( ), he gets feedback only
related to his mistakes in ( ), if he learns and succeeds in another lesson
teaching ( ), and tries to recite ( ) again, he gets feedback related to his mistakes in both letters as they are both saved in his profile now.
5- Auxiliary Database
On starting a training or testing session, some values must be initialized to control the
session lifetime. For example, which word(s) will appear to the user to utter, what is
the reference transcription to that word(s)…etc. All of this information are stored in
this database.
6- Feedback Generator
After the mistakes are filtered according to the user’s knowledge, the feedback
generator analyzes the mistakes and determines the suitable method of guiding theuser to correcting them.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 30/44
24
3.2.2 Database Design
As for the rule of ( ), we have chosen 8 letters which are present in
)( . A speech database was built covering the right and wrong pronunciations
of these letters to train the recognizer.
According to the lessons of Sheikh Ahmed Amer, we followed his methodology asordinary Arabic speakers by default pronounce the correct pronunciation of a letter in
some words while in others they couldn't despite their ignorance of the rules, as the
nature of the word itself forces the speaker to pronounce it correctly in some cases.
So, to cover both cases for each letter, the training database was chosen to contain
four words for each letter, two of which contained this letter read with the mistaken
pronunciation and the other two are other words containing this letter read with the
correct one. For example: for )( , we have the words: (
) the first two are usually mistaken and the user emphasizes ( ) whereas
in the last two, the letter is always pronounced correctly. Each speaker reads these
four words per letter three times. A list of all the words used for training can be found
in Appendix B.
3.2.3 Constraints
The design of our system was base on a few assumptions:
Calm environment
System functions with higher performance when used in a rather relatively calm
environment. Noise at a certain level can degrade the recognition accuracy.
Cooperative user
The word to be pronounced is displayed on the screen for the user, the user is
expected to either pronounce it correctly or mispronounce the letter being taught.The
system doesn’t deal with unexpected words.
Male user
This version of the system has been trained by only young male users voices, so in
order to serve female users, new models has to be constructed and trained by female
users voices.
3.3. Speech Recognition with HTK
In this project, several experiments were done using HTK v3.2.1 to build HMMs for
different recognizers starting from a small English digit recognizer to learn and testthe tool, then a small Arabic word recognizer as an attempt to handle Arabic speech
recognition, until building up the prototype and the speaker-dependent and speaker-
independent versions of the project core.
In this section, we explain the steps followed to build such recognizers. Details of
using each tool can be found in the HTK manual [18].
3.3.1 Data preparation
Recording the data
The first stage in developing a recognizer is building a speech database for training
and testing. Although HTK supports a tool (HSLab) for recording and labeling data,
we used another easier and user-friendly program for that purpose, which is Cool Edit
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 31/44
25
Pro v2. Speech is recorded via a desktop microphone and sampled in 16 bits at 16
kHz. It is saved in the Windows PCM format as a WAV file. Then Cool edit is used to
segment the data word by word giving each a distinct label.
Creating the Transcription Files
To train a set of HMMs, every file of training data must have an associated phonelevel transcription. This process was done manually by writing the phone level
transcription of each word, all in a single Master Label File (MLF) with the standard
HTK format.
Coding the data
Speech is then coded using the tool HCopy where the speech signal is processed first
by separating the signal into frames of 10ms length and then converting those frames
into feature vectors – MFCC coefficients in our case.
3.3.2 Creating Monophone HMMs
Creating Flat Start Monophones
The first step in HMM training is to create a prototype model defining the model
topology. The model usually in phone-level recognizers consists of three states plus
one entry and one exit states.
After that, An HMM seed is generated by the tool HCompV to initializes the
prototype model with a global mean and variance computed for all the frames in every
feature file. Also, this variance scaled with a factor (typically 0.01) is used as variance
flooring to set a floor on the variances estimated in the subsequent steps .
A copy of the previous seed is set in a Master Macro File (MMF) called hmmdefs as
the initialization for every model defined in the HMM list. This list contains all themodels that will be used in the recognition task, namely a model for each phone used
in the training data plus a silence model /sil/ for the start and end of every utterance.
Another file called macros is created that contains the variance floor macro and
defines the HMM parameter kind and the vector size
HMM parameters Re-estimation
The flat start monophones are re-estimated using the embedded training version of the
Baum–Welch algorithm which is performed by HERest tool. Whereby every model is
re-estimated with frames labeled with its corresponding transcription.
This tool is used for re-estimation after every modification in the models and usuallytwo or three iterations are performed every time.
Fixing silence models
Forward and backward skipping transitions are set in the silence model /sil/ to provide
it with a longer duration mean probability. This is performed by the HMM editor tool
HHEd.
3.3.3 Creating Tied-State Triphones
To provide some measure of context-dependency as a refinement for the models, we
create triphone HMMs where each phone model is represented by both a left and right
context.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 32/44
26
First the label editor HLEd is used to convert the monophone transcriptions to an
equivalent set of triphone transcriptions. Then, the HMM editor HHEd is used to
create the triphone HMMs from the triphone list generated by HLEd. This HHEd
command has the effect of tying all of the transition matrices in each triphone set
where one or more HMMs share the same set of parameters.
The last step is using decision trees that are based on asking questions about the left
and right contexts of each triphone. Based on the acoustic differences between phones
according to the different classifications mentioned in section 2.3, phones are
clustered using these decision tress for more refinement. The decision tree attempts to
find those contexts which make the largest difference to the acoustics and which
should therefore distinguish clusters.
Decision tree state tying is performed by running HHEd using the QS command for
questions where the questions should progress from wide, general classifications
(such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone.
3.3.4 Increasing the number of mixture components
The early stages of triphone construction, particularly state tying, are best done with
single Gaussian models, but it is preferable for the final system to consist of multiple
mixture component context-dependent HMMs instead of single Gaussian HMMs
especially for speaker-independent systems. The optimal number of mixture
components can be obtained only by experiments by gradually increasing the number
of mixture components per state and monitoring the performance after each
experiment.
The tool HHEd is used for increasing mixture components with the command MU.
3.3.5 Recognition and evaluation
After training HMMs the tool HVite is used for Viterbi search in a recognition lattice
of multiple paths of the words to be recognized.
Our grammar file is written in the phone-level to provide multiple paths for phone
insertions, deletion or substitutions of the word to be recognized.It is written using the
BNF notation of the HTK, then the tool HParse is used to generate a lattice from this
grammar as an input to HVite.
As we used phone-level grammar, our dictionary is just a list of the phones used like
the following:
sil sila:_h a:_h
a:_l a:_l
… etc.
Note, if there are some paths in the grammar file make context-dependent triphones
that have no corresponding models in the HMMs, we copy the HMMs of the
monophones before tying and them to the final models instead of re-training the
models on the new triphones.
To test the recognizer performance, we run HVite on testing data where the output
transcriptions are written in a Master Label File for all testing files. We then use thetool HResults to compare this file with a reference file of the correct transcriptions
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 33/44
27
where it gives a percentage of the correct recognized words and phones and other
statistics.
3.4. Experiments and results
3.4.1 Prototype
In order to experiment our approach, we started with a prototype that distinguish
between only two pairs of words. The first is the wrong and right pronunciations of
the word )( , and the second is the two cases of the word ( ).
We have chosen these two words specifically to experience identifying pronunciation
errors of more than one phone at the same time as the learner is prone to
mispronounce the letters " ", " ", " " in these two words where the mispronunciation
appears in the vowel after each of them.
The grammar file of the recognizer is as follows:
( sil ( a:_h a_h h: b_h a_h t: a_h |
a:_l a_l h: b_l a_l t: a_h |
a:_h a_h b k_h aa_h r_h aa_h |
a:_l a_l b k_l aa_l r_h aa_h |
<sil> )
sil )
As the number of words is small, there is no need to make a grammar file for each
pair so we included them all in a single grammar. There is also a path in the network for a repated silence /sil/ to absorb any noise.
A speech database for a single speaker were recorded where each of the four words
was recorded 15 times.
When we trained and tested the HMMs with whole database it gave a 100%
recognition result.
But when we divided the data evenly on training and testing it gave a result of 96.67
% correct as only on word was misrecognized, which is an acceptable result.
3.4.2 Speaker-dependent system
After the prototype experiment, we started build our complete model as a speaker-
dependent system firstly.
Based on words of the database in Appendix B, speech data was recorded by a single
speaker 20 times; 5 of them are the correct pronunciations of the commonly
mispronounced words.
The results were 100% when the testing data was the whole training data.
When we divided the data between training and testing where the testing data is 20 %
of the whole database, it gave a result of 95.37 %.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 34/44
28
3.4.3 Speaker-independent system
We started our experiment of the speaker-independent system with 13 adult male
speakers of almost the same age, most of them recording the speech data 3 times.
Speech data was only the 32 words in Appendix B without the 3 verses of (
).
With 3 mixture components per state, the result of the training data as the testing data
was 98.35 % accuracy plus other non-significant errors that do not affect on the
detecting the errors of ( ) like the error of substituting the letter " " with
" " or " " in the word ( ). Such an error was not fully detected as a little
number of speakers was uttering it like that in the training data which was not our
focus.
When testing with a separate test data of 6 speakers with 3 repetitions for each, the
results gave accuracy of 98.76 % also plus other non-significant errors.
When we trained the verses of ( ) it gave an accuracy of around 50% in both
cases of training them separately and training them with the other words of thedatabase where the training data is the same as the testing data.
A possible reason of this result is that each verse is not only one word as in the rest of
the training database but it is a full sentence that include he problems of continuous
speech recognition like more context-dependency which requires more training.
Beside that, some of the transcriptions of the data were a point of disagreement as it
was hard to decide what the right transcriptions of this data are, especially when it is
read quickly.
3.5 Implementation of Other ModulesAfter building the recognizer, other modules of the system were implemented using
Microsoft Visual C# .NET. The following is an explanation of this implementation to
each module in figure [5].
3.5.1 Recognizer interface
We did this function by the ProcessingAudio() method. The method starts a new
process for the recognizer, passes appropriate arguments, hides the DOS shell, and
receives recognizer's output.
If the recognizer succeeds in capturing the audio, it will return no messages, but it will
write the corresponding transcriptions (according to our phonology) in a file.
If the recognizer fails to capture the audio, the feedback generator will be fired, asking
the user to re-utter the word(s).
3.5.2 String Comparator
The transcription file generated by the recognizer will be read by the string
comparator which will compare between the recognized word and the reference
correct word.
In this transcription file, each phone is stored in a distinct line. So, as the same
manner of dealing with the user profile, the file will be read and stored in an array-list
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 35/44
29
(dynamic array) to facilitate comparison with the reference phones stored in another
array-list.
As mentioned before, there three kinds of pronunciation mistakes, insertion,
substitution, and deletion. So, the module was implemented with three methods,
CheckInsersion(), CheckSubstitution() , and CheckDeletion() .The core of the three
methods was implemented but only the CheckSubstitution() was completed and tested
where the only mistakes we handle now fall in this kind of mistakes.
The implementation of the CheckSubstitution() method is as follows, for all the
phones named 'fat-ha ' in the reference word, if it is different to the corresponding
phone in the recognized word(s) with respect to emphasis ( ), and the pervious
phone is in a passed lesson or in the current lesson (in case of training not testing),
then the feedback generator will be fired to report this mistake, otherwise the mistake
will be ignored.
By doing so, we are able to detect all mistakes, but we filter feedback according to the
user's status.
As it is observed, we search for the specific vowel phone 'fat-ha ' as we assumed
that the emphasized phone is the same as the un-emphasized one, and the difference
only appears on the vowels follow the consonant. This assumption is valid for most
consonants; other consonants out of this assumption are not handled in our project,
but can simply be done by adding some additional phones.
3.5.3 Auxiliary Database
For every lesson, the user’s utterance will be tested with two words. So, to decide
which word will appear to the user, and initializing the reference array-list (dynamic
array) for that word, the method TrainWhat() takes the lesson number and the word
number so that the session can be started and returns the corresponding word.
On generating the feedback, for each mistaken phone, we check its lesson to know
whether the user had passed this lesson or not and display the appropriate feedback.
The method GetLessonNo() implements this function using a simple switch case.
Also on generating feedback, we need to map between the phones resulting from
running the recognizer and the corresponding Arabic letters is needed, i.e. a de-
transcriptor. So, the method Corr_Arabic() implements this function by taking a
string represents the phone, passing through a switch case to return the corresponding
Arabic letter.
As observed, switch case is frequently used for simplicity of implementation and thesmall search space, but if the search space increased somewhat, there will be another
decision.
3.5.4 User Profile Analyzer
Two methods perform this analyzer, the first is ReadProfile() which is called at the
beginning of the training session or the testing session to give feedback only
according to lesson numbers stored in this profile (i.e. succeeded lessons by the user).
In case of training (not testing), the current lesson will be taken into consideration
besides the succeeded lessons. The second method is UpdateProfile() which is called
after the user has passed a certain lesson so that it can be considered afterwards.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 36/44
30
The ReadProfile() implementation is as follows, the file is read line by line (where
every lesson number is stored in a distinct line), each lesson number is then added to
an array-list (dynamic array ) to facilitate searching within.
The UpdateProfile() implementation is as follows, the array-list created before in
ReadProfile() is searched to know whether the lesson has been passed before or not, if
not, it will be appended to the profile.
3.5.5 Feedback Generator
This module collects messages generated from the string comparator module and
displays them in a suitable method to guide the user through correcting them.
If no messages are collected, appropriate messages will be displayed for guiding the
user through completing the learning process.
The format of the reported message is some thing like:
>><<
Where words between angle brackets vary according to the letter and the type of
mistake occurred on uttering it. For instance, a mistake in uttering the word ‘ ’ will
produce a message like:
Where, the word ‘ ’ (emphasized) represent the type of the mistake in uttering the
letter ‘ ’.
An example of messages reporting correct utterance is as follows:
...
هللا ...
A final example given here is for guiding the user through correcting his mistakes;
messages of this type are like:
...
...
The method FeedbackOut() implements this module with the aid of the auxiliary
database.
3.3.6 GUI
Navigation through the project forms reaching to the scenario mentioned above is
controlled by the GUI.
Our GUI consists of six forms:
frmUserType: consists of two radio buttons and a command button so that the user can
select his type (registered/unregistered).
frmNewUser : if the user is unregistered, he will be brought to this form which
contains text box and command button. When the user enters his name, an empty new
profile is created for him.
frmOldUser : if the user is registered, he will be brought to this form which contains a
combo box (drop-down list) for registered users, and a command button. A search
process will be done on the directory contains users’ profiles to fill the combo box.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 37/44
31
frmLessons: contains command buttons for choosing a lesson to listen to, and a
command button for performing a test.
frmListening: for playing the chose lesson in the frmLessons, so it contains a cassette-
like buttons for performing this function.
frmTraining: for train the user on the lesson he has heard and testing him.The scenario mentioned above takes place on this form, so it is the most important
form in the project.
One last thing to mention here is that the layout of these forms was drawn using
Microsoft PowerPoint.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 38/44
32
Chapter 4: Conclusion and Future Work
4. CONCLUSION AND FUTURE WORK
In this work, we presented a computer-assisted pronunciation teaching system for a
class of the recitation rules of the Holy Quran. We needed to build background from
various disciplines to achieve this work which was very interesting.
Handling and detecting pronunciation errors as identifying phone insertions, deletions
and substitutions has been proven to be feasible and useful for a considerable class of
recitation rules. Extending the system to cover all the words of the Holy Quran can be
done by a procedure that automatically generate all possible phone paths that cover
the different pronunciations of the word together with robust HMMs trained on a
large database.
The HMM toolkit was an excellent tool for our experiments and it is really powerful
for more research in this area.
The main problem we encountered in our experiments was building the speech
database as not all speakers were pronouncing the words as we expected. And this led
to problems in writing the appropriate transcriptions for each utterance and some data
was totally rejected. But supervising the recording process for many volunteers was
hard to achieve and the time for that would be at least duplicated. Though, the results
are really satisfying and encourage us to continue in this field.
As for the future work of the system, we aim to experiment addressing another class
of recitation rules, namely ( ), by building a higher layer above the
recognizer to count the number of frames for the recognized phone as an indication of
the length the vowel or semi-vowel.
We also aim to investigate the possibility of making the system Web-based for
distance learning.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 39/44
33
REFERENCES
[1] Eskinazi, M. "Detection of foreign speakers’ pronunciation errors for second
language training - preliminary results".
[2] WITT, S.M. and YOUNG, S. (1997) "Computer-assisted pronunciation teaching
based on automatic speech recognition", Language Teaching and Language
Technology.
http://svr-www.eng.cam.ac.uk/~smw24/ltlt.ps
[3] Delmonte, R. "A Prosodic Module for Self-Learning Activities".
http://www.lpl.univ-aix.fr/sp2002/pdf/delmonte.pdf
[4] Gu, L. and Harris, G. " SLAP: A System for the Detection and Correction of Pronunciation for Second Language Acquisition Using HMMs".
[5] Eskinazi, M. "Using Automatic Speech Processing for Foreign Language
Pronunciation Tutoring: Some Issues and a Prototype" .LLT Journal Vol. 2, No.
2, January 1999.
http://llt.msu.edu/vol2num2/article3/index.html
[6] Witt, S. Use of Speech recognition in Computer-assisted Language Learning. Phd
thesis,Cambridge University, 1999.
[7] Neri, A., Cucchiarini, C. and Strik, W. " Automatic Speech Recognition for second language learning: How and why it actually works".
[8] Survey of the State of the Art in Human Language Technology, Center of Spoken
Language Understanding.
http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html
[9] Padmanabhan, M. and Picheny, M. "Large-Vocabulary Speech Recognition
Algorithms", IEEE Computer Magazine, pp 42-50, April 2002.
[10] Ehsani, F. and Knodt, E. "Speech Technology in Computer-Aided Language
Learning: Strengths and Limitations of a New Call Paradigm" LLT Journal Vol.2, No. 1, July 1998.
http://polyglot.cal.msu.edu/llt/vol2num1/article3/
[11] Moreno, D. "Harmonic Decomposition Applied to Automatic Speech
Recognition".
[12] . ""
[13] . " " .
[14] Rabiner, L. and Juang, B. Funadamentals of Speech Recognition, Prentice Hall,1993.
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 40/44
34
[15] James F. Allen, "Signal Processing for Speech Recognition ", Lecture Notes of
CSC 248/448: Speech Recognition and Statistical Language Models Course, Fall
2003, University of Rochester.http://www.cs.rochester.edu/u/james/CSC248/Lec13.pdf
[16] Ra biner,L.; "A tutorial on hidden Markov models and selected applications in
speech recognition"; Proceedings of the IEEE, 1989; vol. 77, no. 2, pp. 257-286.
[17] Xue Wang; "Incorporating Knowledge on Segmental Duration in Hmm-Based
Continuous Speech Recognition" .http://www.fon.hum.uva.nl/wang/ThesisWangXue/chapter2.pdf
[18] Young, S. et al. (2002), The HTK book for version 3.2, Cambridge University.
http://htk.eng.cam.ac.uk/
[19] Kawai, G. and Hirose, K. "A method for measuring the intelligibility and
nonnativeness of phone quality in foreign language pronunciation training".
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 41/44
35
Appendix A: User Manual
A. USER MANUAL
When you run the program, the
first form you will meet carries
the title , if this is
your first time to use the
program then choose
(1)
In this case you will be
transferred to another form
where you enter your name and
new profile is created.Otherwise then you can choose
(2).
Then you will be transferred to
another form where you can
select your name from the drop
list (3) containing all registered
users.
After logging in, and at any
point in the program, you will
not lose sight of the button
(4) which enables you
to re-login as a different user.
The next form contains the list
of lesson to learn, titled with
letter which you will learn the
lesson (5).
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 42/44
36
Then you can listen to the
lesson with the voice of Sheikh
Ahmad Amer by choosing the
Play button (6), or you canreturn to the previous form by
choosing (7) or
(8) to test what you
learned with or
choose to start your teaching
session by choosing (9).
This takes you to the teaching
session form where you can
hear the correct pronunciationof the word (10) by pressing it
or start recording (11) your
reading for this word, when
you are done recording, the
feedback about you reading is
displayed (12) and you can
listen to your own reading by
pressing the Play button (13).
7/23/2019 Speech Lab - Project Report
http://slidepdf.com/reader/full/speech-lab-project-report 43/44
37
Appendix B: Training Database
B. TRAINING DATABASE
The following is a list of the words used for training, where for each letter, the first
two words are usually pronounced wrongly, and the last two words are pronounced
correctly.
LETTER WORDS