chapter 1 introduction -...

25
1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human communication and people are interested in building mechanical models that mimic human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. This makes the speech processing and Automatic Speech Recognition (ASR) by machines as one of the most attractive areas of research over the past five decades (Biing-Hwang (Fred) Juang and Lawrence Rabiner 2005). The ASR is the extraction of linguistic information from an utterance of speech. The ASR is a real time computer based transcription system that converts spoken language into sequence of words. This technology makes the computer to communicate with human by detecting the spoken words and follow human voice commands (Anusuya and Katti 2009). In the starting stage, the speech recognition technology is used for people with physical disabilities who often find typing difficult, painful or impossible. Those who have spelling difficulties are helped to recognize words which are always correctly spelled including those with dyslexic. Now, as the technology has become more sophisticated, its application areas are widened as wherever human machine interface is required like telephone networks, query based information retrieval, creating documents from

Upload: others

Post on 03-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

1

CHAPTER 1

INTRODUCTION

1.1 OVERVIEW

Speech is the primary means of human communication and people

are interested in building mechanical models that mimic human behavior,

particularly the capability of speaking naturally and responding properly to

spoken language, has intrigued engineers and scientists for centuries. This

makes the speech processing and Automatic Speech Recognition (ASR) by

machines as one of the most attractive areas of research over the past five

decades (Biing-Hwang (Fred) Juang and Lawrence Rabiner 2005). The ASR

is the extraction of linguistic information from an utterance of speech. The

ASR is a real time computer based transcription system that converts spoken

language into sequence of words. This technology makes the computer to

communicate with human by detecting the spoken words and follow human

voice commands (Anusuya and Katti 2009).

In the starting stage, the speech recognition technology is used for

people with physical disabilities who often find typing difficult, painful or

impossible. Those who have spelling difficulties are helped to recognize

words which are always correctly spelled including those with dyslexic. Now,

as the technology has become more sophisticated, its application areas are

widened as wherever human machine interface is required like telephone

networks, query based information retrieval, creating documents from

Page 2: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

2

dictations, medical transcriptions, language translation, railway reservation,

super market etc (Anusuya and Katti 2009).

Although many technological improvements have been done in

ASR, recognition accuracy is still far from human levels (Alejandro Acero

1990). Generally when the speech recognition systems applied in real world

applications, where there is no possibility to control the acoustic environment,

there is possibility of mismatch between the testing and training conditions,

causing degradation in the performance (Sankar et al. 1996). This happens

when the system is not designed by considering variability of on-field

environment.

Benzeghiba et al. (2007) given a detailed review of different kinds

of variabilities in ASR. Two broad groups of variabilities can be defined:

extrinsic (non speaker related) and intrinsic (speaker-related) variabilities.

Environmental noise and transmission artifacts are two examples of extrinsic

variabilities.

Besides varying accents, speaking-styles and speaking-rates, age

and emotional state, it is the shape of the vocal tract that intrinsically

contributes to the variable occurrence of speech signals representing the same

textual content (Florian Muller and Alfred Mertins 2011). The problems

originating from different Vocal Track Lengths (VTLs) become especially

apparent in speaker-independent ASR systems (Benzeghiba et al. 2007). The

goal of ASR is to have speech as a medium of interaction between man and

machine and it is desired that an ASR system is robust to these unwanted

variability (Alan Oppenheim 1969).

Page 3: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

3

Bearing these in mind, this research work focuses on the

implementation of noise-resilient and speaker independent speech recognition

system, for continuous speech recognition. Mel-Frequency Cepstral

Coefficients (MFCCs) and auditory transform based features called as

Cochlear Filter Cepstral Coefficients (CFCCs) which resemble the peripheral

auditory system have been applied in this research work as feature extraction

algorithms. To improve the noise robustness of MFCC and CFCC under

mismatched training and testing conditions, these features are enhanced by

the application of wavelet based denoising algorithm called adaptive wavelet

thresholding.

Also, to overcome the effects of inter-speaker variability

originating from different VTLs, in this research work, a feature-extraction

method that is based on the principle of invariant integration is applied. This

method integrates the regular nonlinear functions of the features over the

transformation group for which invariance should be achieved. These features

are referred as Invariant-Integration Features (IIFs).

The basic steps involved in the proposed speech recognition

algorithm are given in Figure 1.1. As preprocessing of speech signal is

considered as a crucial step in the development of a robust and efficient

speech or speaker recognition system, as an initial step, preprocessing of input

speech signal is applied. In the feature extraction stage, as a first step the

MFCC features are extracted and as a second step of feature extraction

auditory transform based feature extraction algorithm has been applied. The

features generated from the auditory transform based feature extraction are

named as CFCCs.

Page 4: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

4

Figure 1.1 Basic steps involved in the proposed speech recognitionalgorithm

After feature extraction, the denoising has been done for the

enhancement of speech features. In this research work, the wavelet based

denoising algorithm called adaptive wavelet thresholding is applied. Then the

features are made robust against VTL changes by the application of invariant

integration. Finally the features are trained/recognized using neural networks.

As the dimensionality of resultant feature vectors, after the

invariant integration, is large, there is a need to optimize and classify the

features. For classification Feature-Finding Neural Network (FFNN) has been

Feature Extraction (MFCC / CFCC)

Training/ Recognition using Neural Network(FFNN)

Speech Enhancement by AdaptiveWavelet Thresholding

Input Speech Signal

Recognized Speech

Pre-processing

Speaker Invariant Feature Extraction usingInvariant Integration Method

Page 5: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

5

used. The FFNN consists of a feature extracting network and a linear

classifier that consists of a linear single-layer perceptron for classifying the

features in order to recognize the words. Tino Gramss and Hans

Werner Strube (1990) had applied this classification system to isolated word

recognition and had proved that the FFNN is faster in recognition than the

classical Hidden Markov Model (HMM) and Dynamic Time Warping (DTW)

recognizers and yields similar recognition rates. For optimization of features,

an algorithm called substitution algorithm (Tino Gramss 1991) is applied in

this work.

Both the proposed MFCC and CFCC based speech recognition

systems have been evaluated in a task, where, the acoustic conditions of

training and testing are mismatched, i.e., the training data set has been

recorded under a clean condition while the testing data sets were mixed with

different types of background noise at various noise levels. Also, the systems

have been tested with matching and mismatching of VTLs. In this work, the

Speech Separation Challenge database has been used for training and testing.

This research works have been attempted using the tool Matlab 7.10.

1.2 NEED FOR ROBUST SPEECH RECOGNITION AND ITS

IMPORTANCE

In order to maintain good speech recognition accuracy even when

the input quality of the speech is corrupted or when there occurs a difference

in the acoustical, articulatory or phonetic characteristics of

speech - ‘Robustness’ is required. Some of the obstacles recognized are

acoustical degradation, which is caused by additive noise, the effects of linear

filtering, nonlinearities in transduction or transmission, as well as impulsive

interfering sources and diminished accuracy caused by changes in articulation

produced by the presence of high intensity noise sources (Richard Stern

1997). Some of the other changes are speaker to speaker difference, variations

Page 6: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

6

in speech rate, co-articulation, context, and dialect. When the training and the

testing conditions differ, the performance of speaker independent systems

also, starts degrading. The invariance of recognition performance under such

disturbances is called robustness.

Speech recognition systems have become much more robust in

recent years with respect to both the speaker variability and acoustical

variability. In addition to achieving speaker independence, many current

systems can also, automatically compensate for modest amounts of acoustical

degradation caused by the effects of the unknown noise and the unknown

linear filtering (Pedro Moreno et al. 1995).

As speech recognition and spoken language technologies are being

transferred to the real world applications, the need for greater robustness in

recognition technology is becoming increasingly apparent (Meenakshi

Sharma and Salil Khare 2009).

With the rapid development of voice communication and

information systems, efficient interactions between the users and the terminals

or remote database systems are required. Robust speaker dependent/

independent speech recognition technology makes these possible. First

systems are available that can compensate for modest amounts of acoustical

degradation caused by the effects of the unknown noise and the unknown

linear filtering. Still, the performance of even the best state-of-the-art systems

is heavily deteriorated in the mentioned adverse conditions. This is one of the

main reasons that prevent ASR from being used in everyday situations, so

increased robustness is still a very desirable property in ASR.

There exist three different approaches in order to achieve this goal

(Wynand Harmse 2004):

Page 7: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

7

As a first approach the disturbances are to be removed from the

speech signal before features that carry speech-relevant information could be

extracted. There exist a number of methods to deal with additive or

convolutive noise (like spectral subtraction, processing with the

Ephraim- Malah algorithm or inverse filtering). One of the downsides of such

processing is that the application of these techniques produces artifacts in the

speech signal, for example, due to wrong estimation of the noise signal.

Another approach is to design a robust feature extraction, where

features are as invariant as possible under adverse acoustical conditions. This

approach has been pursued in this research work. In the third approach, the

classifier is designed to cope with a large variety of noise signals. This is

achieved by training multiple acoustical models with speech under different

noise conditions. The problem with this approach is that the increase in

computational cost and demand for memory. Another problem is the

automatic selection of the appropriate model in dependence of the actual

acoustical situation.

1.3 GENERAL FEATURE EXTRACTION MODELS IN SPEECH

RECOGNITION

Feature extraction is the process of obtaining different features such

as power, pitch, and vocal tract configuration from the speech signal. Feature

extraction is the first crucial component in automatic speech processing.

Generally speaking, successful front-end features should carry enough

discriminative information for classification or recognition, fit well with the

back-end modeling and be robust with respect to the changes of acoustic

environments. At a high level, most speech feature extraction methods fallinto the following two categories:

i. Modeling the human voice production system

ii. Modeling the peripheral auditory system.

Page 8: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

8

1.3.1 Human Auditory-System Based Feature Extraction in Speech

Recognition

The imitation of the human hearing system is a promising research

direction towards improving feature robustness (Qi Li and Yan Huang 2010).

Motivated by the fact that the human auditory system outperforms current

machine-based systems for acoustic signal processing, many research works

have been done for developing high performance systems. The traveling

waves of the basilar membrane in the cochlea and its impulse response have

been measured and reported in the literature (Aage Moller 1977, Sellick et al.

1982). Moreover, the basilar membrane tuning and auditory filters have also,

been studied in the literature (Roy Patterson 1976, Brian Moore et al. 1990,

Bin Zhou 1995, Kuansan Wang and Shihab Shamma 1995, Dennis Barbour

and Xiaoqin Wang 2003). Many electronic and mathematic models have

been defined to simulate the traveling wave, the auditory filters, and the

frequency responses of the basilar membrane (James Flanagan 1972, (Richard

Lyon and Carver Mead 1988, James Kates 1991, James Kates 1993, Liu et al.

1992).

1.3.1.1 Human auditory-system

The human ear, as shown in Figure 1.2, has three sections: the outer

ear, the middle ear and the inner ear (Xuedong Huang et al. 2001). The outer

ear consists of the external visible part and the external auditory canal that

forms a tube along which sound travels. This tube is about 2.5 cm long and is

covered with the eardrum at the far end. When the air pressure variations

reach the eardrum from the outside, it vibrates, and transmits the vibrations to

bones adjacent to its opposite side (William Tecumseh Sherman Fitch 1994).

The vibration of the eardrum is at the same frequency (alternating

compression and rarefaction) as the incoming sound pressure wave. The

middle ear is an air-filled space or cavity about1.3 cm across, and about

Page 9: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

9

63 cm volume. The air travels to the middle ear cavity along the tube (when

opened) that connects the cavity with the nose and the throat. The mammalian

inner ear is a spiral structure, the cochlea (snail), consisting of three uid-

lled chambers, or scala, the scala vestibuli, the scala media, and the scala

tympani. The oval window shown in Figure 1.2 (Ben Cloptonand and Francis

Spelman 2003) is a small membrane at the bony interface to the inner ear

(cochlea). Since the cochlear walls are bony, the energy is transferred

by a mechanical action of the stapes into an impression on the membrane

stretching over the oval window.

Figure 1.2 The structure of the human ear

The relevant structure of the inner ear for sound perception is thecochlea, which communicates directly with the auditory nerve, conducting arepresentation of sound to the brain. The cochlea is a spiral tube about 3.5 cmlong, which coils about 2.6 times.

The pirals divided, primarily by the basilar membrane runninglength wise, into two fluid-filled chambers. The cochlea can be roughlyregarded as a filter bank, whose outputs are ordered by location, so that afrequency-to-place transformation is accomplished. The filters closest to the

Page 10: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

10

cochlear base respond to the higher frequencies and those closest to its apexrespond to the lower.

1.3.1.2 Existing methods

Based on the concept of the cochlea, many feature extraction

algorithms have been developed for speech recognition, such as

MFCCs, Fourier transform, wavelet transform and others (Qi Li 2009).

The Fourier transform is the most popularly used transform to

convert signals from the time domain to frequency domain. However, it has a

fixed time-frequency resolution and its frequency distribution is restricted to

be linear. These limitations generate problems in audio and speech processing

such as the pitch harmonics, computational noise, and sensitivity to

background noise. On the other hand, the wavelet transform provides flexible

time-frequency resolution, but also has notable problems. First, no existing

wavelet is capable of mimicking the impulse responses of the basilar

membrane closely, so it cannot be directly used to model the cochlea or carry

out related computation. Additionally, even though forward and inverse

continuous wavelets transforms are defined for continuous variables, there is

no numerical computational formula for real Inverse Continuous Wavelet

Transforms (ICWT). No such function exists even in a commercial wavelet

package. Discrete Wavelet Transform (DWT) has been applied in speech

processing, but the frequency distribution is limited to the dyadic scale which

is different from the scale in the cochlea.

Perceptual Linear Predictive (PLP) analysis is another peripheral

auditory-based approach. Based on the FFT output, it uses several

perceptually motivated transforms, including Bark frequency, equal-loudness

pre-emphasis, and cubic-root amplitude compression (Hynek Hermansky

1990). The Relative Spectra (RASTA) is further developed to filter the time

Page 11: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

11

trajectory to suppress constant factors in the spectral component (Hynek

Hermansky and Nelson Morgan 1994). RASTA has been often cascaded with

the PLP feature extraction to form the RASTA-PLP features. Comparisons

between MFCC and RASTA-PLP have been reported by Grimaldi and

Cummins (2008). Both MFCC and RASTA-PLP features are based on the

Fourier transform. The Fourier transform has a fixed time–frequency

resolution and a well-defined inverse transform. Fast algorithms exist for both

the forward transform and the inverse transform. Despite its simplicity and

efficient computation algorithms, when applied to speech processing the

time–frequency decomposition mechanism of the Fourier transform is

different from the mechanism in the hearing system. First, it uses fixed-length

windows, which generate pitch harmonics over the entire speech bands.

Second, its individual frequency bands are distributed linearly, which is

different from the distribution in the human cochlea. Further wrapping is

needed to convert to the Bark, Mel, or other scales (Qi Li and Yan Huang

2011).

Gammatone filter banks (Johannesma 1972) has been proposed to

model the impulse responses of the basilar membrane and has been used to

decompose time-domain signals into different frequency bands. However,

there is no mathematical proof of how to synthesize the decomposed

multichannel signals back to a time-domain signal. Although some

suggestions on resynthesis have been given in plain language, (Mitchel

Weintraub 1985), or simply at the conceptual level, there remain no details or

mathematical proof to validate the accuracy and computation efficiency. In

(Hohmann 2002), a Gammatone based transform with analysis and synthesis

was presented, but the filter bank derives a complex valued output which is

not only different from the real cochlea, but further complicates its

implementation.

Page 12: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

12

However, an auditory based transform with both forward and

inverse transforms for digital computers is needed for many audio

applications, such as noise reduction, hearing aid, coding, speech and music

synthesis, speaker and speech recognition, etc. To employ the concept of the

auditory system to the audio signal processing, Qi Li (2009) has expressed an

auditory-based transform, which was inspired by the traveling waves in the

cochlea and this auditory-based transform provides a new platform for the

research of robust feature extraction. This transform is modeled to provide a

simple and fast transform for real application, as an alternative selection to the

Fourier transform and WT. Based on this auditory-based transform, Qi Li and

Yan Huang (2010), (2011) has developed an auditory-based feature extraction

algorithm for robust speaker identification. Under mismatched acoustic

conditions, this feature consistently performs better than the MFCC features.

1.3.2 Human Voice Production System Based Feature Extraction in

Speech Recognition

A general problem in speaker-independent ASR is the high

variability that is inherent in human speech. The problems originating from

different VTLs become especially apparent in mismatching training-testing

conditions. For example, if children use an ASR system whose acoustic

models have only been trained with adult data, the recognition performance

degrades significantly compared to the performance of adult users. Therefore,

in speaker-independent ASR systems, one often uses speaker-adaptation

techniques to reduce the influence of speaker-related variabilities.

A common model of human speech production is the source-filter

model (Xuedong Huang et al. 2001). In this model, the source corresponds to

the air stream originating from the lungs and the filter corresponds to the

vocal tract, which is located between the glottis and lips, and is composed of

different cavities. The locations of the vocal tracts’ resonance frequencies (the

Page 13: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

13

“formants”) shape the overall short-time spectrum and define the phonetic

content. The spectral effects of different VTLs have been widely studied

(Benzeghiba et al. 2007). An important observation is that, while the absolute

formant positions of individual phones are speaker specific, their relative

positions for different speakers are somewhat constant. A relation that

describes this observation is given by considering a uniform tube model with

lengthl . Here, the resonances occur at frequencies F = c (2i 1) (4l) ,

i = 1,2,3, …, where c is the speed of sound (John Deller et al. 1993). Using

this model, the spectra S of the same utterance from two speakers and

with different VTLs are related by a scaling factor , which is also, known as

the frequency-warping factor (Florian Muller and Alfred Mertins 2011):

S ( ) = S ( ) (2.1)

In typical speaker-independent ASR task, the value of is

between 0.8 and 1.2. This intrinsic variability has a negative effect on the

recognition rate of speaker independent ASR systems. Though Equation (2.1)

is only a rough approximation for the real relationship between spectra from

speakers with different VTLs, methods that try to achieve speaker

independency for an ASR system commonly take this relationship as their

fundamental assumption. A time-frequency analysis of the speech signal is

usually the first operation in an ASR feature extraction stage after possible

preprocessing steps such as pre-emphasis or noise cancellation. This analysis

tries to simulate the human auditory system up to a certain degree, and

different methods have been proposed. As it is done for the computation of

the well known MFCCs, a basic approach is the use of the Fast Fourier

Transformation (FFT) applied on windowed short-time signals whose output

is weighted by a set of triangular bandpass filters in the spectral domain

(Xuedong Huang et al. 2001). Another common filterbank approach uses

Page 14: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

14

gammatone filters (Patterson et al. 1992). These filters were shown to fit the

impulse response of the human auditory filters well. Both types of

time-frequency analysis methods have in common that they locate the center

frequencies of the filters evenly spaced on nonlinear auditory motivated

scales. In case of the MFCCs the Mel scale is used (Steven Davis and Paul

Mermelstein 1980), and in case of a gammatone filterbank the Equivalent

Rectangular Bandwidth (ERB) scale is used (Patterson et al. 1992, Moore and

Glasberg 1996, Kentaro Ishizuka et al. 2006). Different works make use of

the observation that both the Mel and the ERB scale approximately map the

spectral scaling as described in Equation (2.1) to a translation along the

subband-index space of the time-frequency analysis.

1.3.2.1 Human voice production system

The vocal tract is a fundamental component of the human speech

production system. The gender and age of the individual speaker is the two

factors that determine the average VTL (Louis-jean Boe et al. 2006). VTL is

(in mammals) the distance from the glottis to the outer portion of the lips

shown as a dark line in Figure1.3 (William Tecumseh Sherman Fitch 1994).

The place where the vocal folds come together is called the glottis. The

length of the vocal tract is strongly inhibited by body size, lip size and the

position of the larynx. Since the length of the vocal tract controls (all other

things being equal) the dispersion of formants in the vocal tract transfer

function, formant dispersion should provide a readily-available acoustic cue

to body size. VTL represents the source of variability between individual

speakers. On average, the VTL is about 14 cm for adult women and 17 cm for

men (Louis-jean Boe et al. 2006).

Page 15: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

15

Figure 1.3 Side view of the human vocal tract

1.3.2.2 Existing methods

Methods that try to achieve speaker invariant feature extraction can

be roughly grouped into three categories (Florian Muller and Alfred Mertins

2011). These groups act on different stages of the ASR process and often may

be combined within the same ASR system. One group tries to normalize the

features after the extraction (Lee and Rose 1998), (Welling et al. 2002) by

estimating the implicit warping factors of the utterances. These techniques are

commonly referred to as VTL normalization (VTLN) methods. A second

group of methods adapts the acoustic models to the features of each utterance

(Leggetter and Woodland 1995). The use of Maximum-Likelihood Linear

Regression (MLLR) methods is part of most state of- the-art recognition

systems nowadays. It was shown by Pitz and Ney (2005) that certain types of

VTLN methods are equivalent to constrained MLLR. Thus, a third group of

Page 16: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

16

methods tries to generate features that are independent of the warping factor

(Welling et al. 1999, Rademacher et al. 2006, Monaghan et al. 2008).

The concept of computing features that are independent of the VTL

has been taken up by several works in the past, and different methods were

proposed. Leon Cohen (1993) has introduced the scale transformation which

was further investigated for its applicability in the field of ASR by Umesh

et al. (1999). Its use in ASR is motivated by the relationship given in

Equation (2.1). One property of the scale transformation is that the

magnitudes of the transformations of two scaled versions of one and the same

signal are the same. Thus, the magnitudes can be seen to be scaling invariant.

The scale Cepstrum, which has the same invariance property, was also

introduced by Umesh et al. (1999). The scale transformation is a special case

of the Mellin transformation. Roy Patterson (2000) has described a so-called

Auditory Image Model (AIM) that was extended with the Mellin

transformation by Irino and Patterson (2002). Further studies about the Mellin

transformation have been conducted, for example, by Antonio De Sena and

Davide Rocchesso (2005).

Various works rely on the assumption that the effects of inter-

speaker variability caused by VTL differences are mapped to translations

along the subband-index space of an appropriate filter bank analysis (Alfred

Mertins and Jan Rademacher 2005, 2006, Monaghan et al. 2008, Florian

Muller et al. 2009, Florian Muller and Alfred Mertins 2009, 2010,

Rademacher et al. 2006).

Alfred Mertins and Jan Rademacher (2005, 2006) have used a

wavelet transformation for the time-frequency analysis and proposed

so-called VTL Invariant (VTLI) features based on auto- and cross-

correlations of wavelet coefficients. Jan Rademacher et al. (2006) have shown

Page 17: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

17

that a gammatone filterbank instead of a wavelet filterbank leads to a higher

robustness against the VTL changes.

Methods that extract a time-frequency representation of an input

signal for ASR tasks commonly locate the frequency centers of the analysis

filters on auditory motivated scales like the Mel or ERB scale. Using these

scales, (Monaghan et al. 2008, Umesh et al. 1996) it was proved that VTL

changes approximately lead to translations in the subband- index space of

these time-frequency representations. This can be utilized for the computation

of features that are invariant to translation (Jan Rademacher et al. 2006,

Monaghanet al. 2008). The invariance can lead to an increase of robustness

against VTL changes. The determination of invariants is well-founded in the

field of mathematics and physics. Practical methods for the retrieval of

invariants against rotation and translation were especially applied in the field

of pattern recognition.

The cyclic autocorrelation of a sequence and the modulus of the

Discrete Fourier Transform (DFT), fall under this type of transforms. A

general class of translation-invariant transforms was introduced by Wagh and

Kanetkar (1977) and further Hans Burkhardt and Xaver Muller (1980) have

investigated in the field of pattern recognition. It is called the class CT.

Another class of transformation, investigated in the field of speaker

independent speech recognition known as Generalized Cyclic

Transformations (GCT) (Florian Muller et al. 2009). Instances of this class

were successfully used in the field of pattern recognition and feature

extraction in ASR systems.

One of these general methods integrates regular nonlinear functions

of the features over the transformation group for which invariance should be

achieved. This method is commonly known as “invariant integration”. It was

shown by Florian Muller and Alfred Mertins (2009) that in large- vocabulary

Page 18: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

18

phoneme recognition experiments that the invariant integration based feature

sets lead to better recognition results than the standard MFCCs under

matching training and testing conditions and that these features outperform

the MFCCs in the cases in which training and testing conditions differ with

respect to the mean VTL.

1.4 SPEECH ENHANCEMENT BY DENOISING

In speech processing applications such as mobile communications,

speech recognition and hearing aids, speech has to be processed in the

presence of a background noise. Therefore, the problem of removing

uncorrelated noise components from the noisy speech, i.e., speech

enhancement, has been widely studied in the past, and it is still remaining as

an important issue in the field of speech research. Speech enhancement

techniques have been employed to improve the quality, and intelligibility of

the noise corrupted speech and/or the speech recognition performance. The

performance of such applications is highly dependent on how much the noise

is removed (Sumithra and Thanushkodi 2009).

1.4.1 Existing Methods

The problem of denoising consists of removing noise from

corrupted signal without altering it. Generally the noise sources are classified

as additive and convolutional. The former very often dominates in the real

world applications and the Spectral Subtraction (SS) approach has been a very

popular example solution for it (Steven Boll 1979, Michael Berouti et al.

1979, Sunil Kamath and Philipos Loizou 2002, Yasser Ghanbari et al. 2004,

Yasser Ghanbari and Mohammad Reza Karami Mollaei 2004, Sumithra and

Thanushkodi 2009. To subtract the noise components from the input noisy

speech, the SS algorithm has to estimate the statistics of the additive noise in

the frequency domain. Under low Signal-to-Noise Ratio (SNR) a condition, a

Page 19: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

19

spectral flooring process is usually taken to prevent the over-subtraction

situation occurred. However, all such processes very often produce some

unnatural residual noise in the enhanced speech, the so-called musical noise,

due to the inevitable random tone peaks generated in the time-frequency

spectrogram. Previous studies have pointed out that this perceivable residual

noise can be effectively alleviated by considering the masking effect in

human auditory system (Dionysis Tsoukalas et al. 1997, Nathalie Virag 1999)

i.e., the residual noise will not be perceived if it is under the masking

thresholds in human auditory functions.

In the recent years, several alternative approaches such as signal

subspace methods (Yariv Ephraim and Harry Van Trees 1995), (Mark Klein

and Peter Kabal 2002) have been proposed for enhancing the degraded

speech. In subspace method, the estimation of the signal subspace dimension

is difficult for unvoiced period and transitional regions. Existing approaches

to this task include traditional methods such as spectral subtraction and

Ephraim Malah filtering (Yariv Ephraim and David Malah 1984).

A drawback of this technique is the necessity to estimate the noise or the

SNR. This can be a strong limitation when recording with non-stationary

noise and for situations where the noise cannot be estimated. Fourier domain

was long been the method of choice to suppress noise. Recently, methods

based on the wavelet transformation have become increasingly popular.

Wavelets provide a powerful tool for non-linear filtering of signals

contaminated by noise. Stephane Mallat and Wen Liang Hwang (1992) have

shown that effective noise suppression may be achieved by transforming the

noisy signal into the wavelet domain, and preserving only the local maxima of

the transform. Alternatively, a reconstruction that uses only the

large-magnitude coefficients has been shown to approximate well the

uncorrupted signal.

Page 20: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

20

In other words, noise suppression is achieved by thresholding the

wavelet transform of the contaminated signal. The method of wavelet

threshold denoising is based on the principle of the multiresolution analysis.

The discrete detail coefficients and the discrete approximation coefficients

can be obtained by multi-level wavelet decomposition. Wavelet Thresholding

is a simple, non-linear technique, which operates on one wavelet coefficient at

a time. In its most basic form, each coefficient is thresholded by comparing

against threshold, if the coefficient is smaller than threshold, set to zero;

otherwise it is kept or modified. Replacing the small noisy coefficients by

zero and inverse wavelet transform on the result may lead to reconstruction

with the essential signal characteristics and with less noise.

David Donoho (1995) has introduced wavelet thresholding

(shrinking) as a powerful tool in denoising signals degraded by additive white

noise and more recently a number of attempts have been made to use

perceptually motivated wavelet decompositions coupled with various

thresholding and estimation methods (Ing Yann Soon et al. 1997, Sungwook

Chang et al. 2002, Hamid Sheikhzadeh and Hamid Reza Abutalebi 2001, Jong

Won Seok and Keun Sung Bae 1997). The most known thresholding methods

in the literature are the soft and hard thresholding. Comparing with hard

thresholding, soft thresholding is more efficient in denoising. Although the

application of wavelet shrinking for speech enhancement has been reported in

literature (Jong Won Seok and Keun Sung Bae 1997, Yasser Ghanbari and

Mohammad Reza Karami Mollaei 2006, Michael Johnson et al. 2007,

Sumithra et al. 2009) there are many problems yet to be resolved for a

successful application of the method to speech signals degraded by real

environmental noise types.

In the most techniques which use the wavelet thresholding for

speech enhancement, they may suffer from a main problem that is the

Page 21: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

21

detection of the voiced/unvoiced segments of the speech signals (Hamid

Sheikhzadeh and Hamid Reza Abutalebi 2001, Jong Won Seok and Keun

Sung Bae 1997. For the incorrectly classified segments, the enhancement

performance drastically decreases. The other controversial subjects affecting

the enhancement performance are the thresholding function and the threshold

value.

In general, a small threshold value will leave behind all the noisy

coefficients, and subsequently the resultant denoised speech signal may still

be noisy. On the other hand, a large threshold value converts more number of

coefficients to zero, which directs to smooth the signal, destroys details and

the resultant image may cause blur and artifacts.

There are some defects with the basic wavelet thresholding method

when it is faced to the noisy speech corrupted by real-world noises. The basic

method assumes that the noise spectrum is white. However, not only the white

noise does not exist in the real environment, but also colored noises have to

be faced in most practical systems. Therefore, the basic wavelet shrinkage

does not result in good speech quality and cannot remove the non-stationary

noises.

The next problem is the shrinkage of the unvoiced speech segments

which contain many noise-like speech components. This leads to degradation

of the quality of the enhanced speech. The use of a single threshold for all

wavelet packet bands is not reasonable and use of the classic thresholding

functions like the Hard and Soft thresholding functions often brings about

time-frequency discontinuities. Therefore, an adaptive threshold value should

be found out that is adaptive to different subband characteristics. In general,

adaptive approaches have been found to be more effective than their global

counterparts. Wavelet-based techniques using coefficient thresholding

(Stephane Mallat and Wen Liang Hwang 1992), using adaptive thresholding

Page 22: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

22

(Sumithra and Thanushkodi 2009), approaches have also been applied to

speech enhancement.

To solve the problem of poor understandability of the speech

signals processed by the fixed wavelet threshold, a new speech enhancement

method of adaptive wavelet thresholds is presented by Zhang Jie et al. (2009).

In this method, the types of additive noise are ascertained initially according

to the differences in the spectrum amplitude between white noise (including

color noise with flatting spectrum amplitude) and color noise with varying

spectrum amplitude. Since Lipschitz exponent varies with the types of noise

and speech, different kinds of the adaptive threshold function of the wavelet

transform are used to enhance the noisy speech signals according to the types

of noise.

1.5 PROBLEM DEFINITION

Based on the literature survey the following problems are

identified,

In the computation of Mel-frequency Cepstral Coefficients,

Fast Fourier Transform is used to produce the spectrum in

linear scale. As FFT has a fixed time-frequency resolution,

and because of its linear frequency distribution, performance

of MFCC based speech processing system gets affected by the

background noise.

There is a need for a new algorithm which will work well in

mismatching conditions than MFCC.

The recognition rate of noise robust MFCCs and auditory

transform based feature extraction algorithm are still far from

humans, and this is because of speaker variations.

Page 23: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

23

The speaker independent speech recognition systems are to be

designed in diverse training-testing conditions with respect to

the mean VTL.

In order to overcome these difficulties this research work is

attempted to implement a speech recognition system that is robust against

noise and speaker variation.

1.6 OBJECTIVES

To design and implement a new feature extraction algorithm

based on human auditory system for the improvement of the

recognition rate of speech recognition system.

To design and implement the noise enhancement system with

human auditory system based feature extraction algorithm for

the improvement of noise robustness of speech recognition

system

To design and implement a speech recognition system that is

robust against the noise and the speaker variance.

1.7 MOTIVATION

This research work has been done with the motivation of designing

an interactive, ubiquitous teaching robot. Today, number of people going for

higher education worldwide is much lower (National Science Board 2012).

Even after getting higher degrees they are not interested in selecting teaching

as their profession, they are fancied towards software industry

(National Science Board 2012). At the same time, there is less expertise in

cutting edge technologies to educate the students. These difficulties of the

traditional educational field can be removed by incorporating the emerging

Page 24: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

24

ubiquitous technology in education. This method can be applied for

empowering the quality of higher education to support learners. Quality

enhancement of higher education ensuring better performance can be done by

making the possibility of interaction between the robot teacher and students/

learner. Based on these constraints this research work has been attempted to

design the speech recognition part of the interactive, ubiquitous teaching

robot.

1.8 ORGANIZATION OF THE THESIS

A detailed review of literature about auditory transform based

feature extraction, wavelet thresholding for speech enhancement and VTL

Invariant algorithms in feature extraction are covered in Chapter 2.

Preliminary works done on MFCC, auditory transform based

feature extraction, VTL invariance algorithms and FFNN are included

Chapter 3.

MFCC with the adaptive wavelet thresholding and the invariant

integration algorithm have been presented in chapter 4. Prior to that the

existing method of MFCC based feature extraction and their noise robustness

have been discussed. The implementation of Enhanced Mel-frequency

Cepstral Coefficients Invariant Integration Features (EMFCCIIFs) based

feature extraction method has been compared with standard MFCC features,

and its results have been tabulated. Results show that, under mismatched

conditions, the EMFCCIIFs perform consistently better than the standard

MFCC features.

The existing method of auditory-based feature extraction algorithm

and the proposed auditory-based feature extraction algorithm with the

adaptive wavelet thresholding and the invariant integration have been

Page 25: CHAPTER 1 INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/24819/6/06_chapter1.pdf · 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW Speech is the primary means of human

25

explained in chapter 5. The enhanced features have been then tested under

different noise, matching and mismatching of VTL conditions. Performances

of CFCC and Enhanced Cochlear Filter Cepstral Coefficient Invariant

Integration Features (ECFCCIIFs) have been compared. Experiments show

that, under mismatched conditions, the ECFCCIIFs perform consistently

better than the existing CFCC features. Also, performances of ECFCCIIFs

and EMFCCIIFs are compared. Results have shown that recognition

accuracies of ECFCCIIFs are higher than the corresponding EMFCCIIFs

accuracies.

Conclusions derived from this research work with the scope of

future work have been summarized in chapter 6.