noise robust novel approach to speech recognition

Noise Robust Novel Approach to Speech Recognition

Swapnil D. Daphal(PG Student ) Elect. & Telecommunication Department

SKN College of Engineering, Vadgaon (Bk.) Pune, India.

[email protected]

Prof. Sonal K. Jagtap (Asst. Prof) Elect. & Telecommunication Department

SKN College of Engineering, Vadgaon (Bk.) Pune, India.

[email protected]

Abstract Most practical methods of the speech recognition (SR) are dependent on the feature extraction schemes used in the implementation. The performances of these SR systems are highly affected by the presence of noise. By passing speech signal through cochlear filter bank (CFB) prior to the feature extraction reduces the impact of the noise on the system. In this paper, noise robust approach of feature extraction, cochlear filter bank with zero crossing as a feature is discussed. The comparative analysis of CFB with Mel frequency Cepstral Coefficients (MFCC) approach of feature extraction in terms of recognition accuracy (RA) is discussed. It may be implied that former approach gave a good fit to the experimentation in presence of noise.

Index Terms Speech recognition, Cochlear filter bank, recognition accuracy, MFCC, Feature extraction.

INTRODUCTION Speech is the important tool used by the human being to

carry vital information and express the ideas. Speech is produced by the controlled movement of the internal organs which forms the vocal tract. State of the art speech recognition systems are equipped with different algorithms for generation of the optimal response. Automatic speech recognition involves the identification of the spoken word by individual. The host platform carrying the speech recognition system is unaware of the language, but when word is uttered the speech recognition system analyses the features from the speech signal and sends the command to recognize the word. The user application then determines the further necessary action to perform the task. The speech recognition system involves different modules and each of these has its significance in overall operation.

The process of the speech recognition is highly affected by the surrounding in which the system resides. Noise is the major obstacle which contaminates the maximum possible response of the system. Feature extraction plays significant role in the optimality of the system response. Mel frequency cepstral coefficients, the most popular spectral based parameter used in recognition approach gives the better recognition accuracy in clean environment. Cochlear filter bank with zero crossings as a feature is another interesting approach of feature extraction. It has great deal of improvement in terms of recognition accuracy in the noisy environment.

In this paper, the comparative analysis of these two methods namely MFCC and CFB is given. Section II gives the literature survey of the various feature extraction techniques, section III gives the detail architecture of the MFCC and CFB. Section IV gives the overview of the classification method used for the implementation. Experimentation and results are discussed in the section V. Conclusions are reported into the section VI.

LITERATURE SURVEY Speech recognition process includes front end processing

part that converts a speech signal into features useful for further faithful processing. Feature extraction stage plays important role in obtaining the important features from the speech signal which are comparatively insensitive to talker and channel variability. The features obtained after this crucial step decreases the data rate in the later part of the speech recognition and reduces the redundancy residing in the speech signal. Wide range of feature extraction techniques is dependent on the standard signal processing techniques, such as linear predictive coding, filter banks, or cepstral techniques. Some novel methods are also involved which are based on the human perception of the speech signal [1]. Feature extraction based upon auditory model system has shown the better performances than conventional signal processing schemes [2]. Multilayered artificial neural network for linear predictive coefficients for speech recognition also yielded the results in finite accuracy and the recognition performance. This scheme relates the time varying nature of the speech signal with its optimal performance [3]. Seneffs model operates well for real time application; however time domain nature of the model make the approach more computationally expensive than frequency domain techniques available [4]. Another model, ensemble interval histogram (EIH) developed by Ghitza uses physiologically based linear filter bank and level crossing detectors in its implementation and provides better real time recognition accuracies with less computation effort [5]. Data reduction techniques reduce the feature sets and increase the success rate of the system in speech recognition. Principal component analysis (PCA), linear discriminant analysis (LDA) are some of the methods known to reduce the data size ensuring the better success rate of the system [6]. Chandawan and Siwat Suksri have reviewed the importance of the support

2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies

978-1-4799-2102-7/14 $31.00 2014 IEEEDOI 10.1109/ICESC.2014.55

289

Fig. 1 Block diagram of speech recogni

Vector machines and principal component frequency cepstral coefficients (MFCC) in imrecognition rates [7]. Alain Biem and Shigerinvestigated the experiments with discrextraction and provided a novel recognitionof the filter banks [8]. Multiresolution basedmethod is suggested by the Ramlingam environments [9]. The loss of information coby optimization of the parameter of the melforward by Chulhee Lee [10]. Figure 1 givrepresentation of the speech recognition syste

FEATURE EXTRACTION METHSpeech recognition is highly dependent on thare adopted for the feature extraction. Mel fcoefficients (MFCC) and zero crossings ocochlear filter bank processing are the mselected for the feature extraction in this sexplains the various building blocks of the of the feature extraction schemes cleaimplementation issues of the MFCC. Firstlysignal is pre-emphasized to artificially afrequencies. Speech signals are non-stmeaning to say the transfer function of the generates it, changes with time, though it chaPre emphasis of the input speech is given by It is safe to assume that it is piecewise staspeech signal is framed with 30ms to 32moverlap of 20ms.The need for the overlap imay be lost at the frame boundaries, so framto be within another frame. Mathematicequivalent to multiplying the signal with arectangular windows. The problem with recis that the power contained in the side lobhigh and therefore may give rise to spectralto avoid this we use a Hamming window giv

The next step in the processing is individual frame so as to minimize the signathe beginning and end of the frame. The c

tion system

analysis with mel mprovement of the r katagiri carefully riminative feature n oriented analysis d feature extraction

for the adverse ould be minimized l-cepstrum has put ves the schematic em.

HODS he methods which frequency cepstral obtained after the

methods which are ection. Figure 5.2 MFCC. The flow arly reveals the y the input speech amplify the high tationary signals, vocal tract; which anges gradually. (1).

ationary. Thus the

ms frames with an is that information

me boundaries need cally, framing is a series of sliding ctangular windows es is significantly l leakage. In order

ven by (2). to window each

al discontinuities at concept here is to

minimize the spectral distortionthe signal to zero at the beginniis the window and x(n) is windowed signal is given by (3

The next processing step is theconverts each frame of N samthe frequency domain. The implement the Discrete Fouridefined on the set of N samnumber Y(k) representing mfrequency component in the oriby (4).

Fig. 2 The general flow diagram o

! "# $%

The computational complexityvalue of the N is could be varsize of FFT contributes to comp

As mentioned above, psychophhuman perception of the freqspeech signals does not followtone with an actual frequency, pitch is measured on a scale cfrequency scale is linear frequand a logarithmic spacing abpoint, the pitch of a 1 kHz tonhearing threshold, is defined ause the following approximatefor a given frequency f in Hz.

&"' '()* + ,To make the frequency estimsensitive to the slight variationsthe output of the Mel filter bPhase information does not analysis so logarithm may impl

n by using the window to taper ing and end of the frame. If w(n) the input speech signal then

3).

- e Fast Fourier Transform, which

mples from the time domain into FFT is a fast algorithm to

ier Transform (DFT) which is mples and is given by complex magnitude and phase of that iginal signal. DFT is represented

of the generation of the MFCC.

. y of the FFT is N*log (N). The ried between 512 or 1024. The putational complexity.

hysical studies have shown that quency contents of sounds for w a linear scale. Thus for each f, measured in Hz, a subjective

called the mel scale. The mel-uency spacing below 1000 Hz

bove 1000 Hz. As a reference ne, 40 dB above the perceptual

as 1000 mels. Therefore we can e formula to compute the mels

mates of more robust and less s in input, squared magnitude of bank is converted to log scale. play vital role in the speech

ly to the optimum results.

290

In this final step, we convert the log mel specThe result is called the mel frequency ce(MFCC). The cepstral representation of theprovides a good representation of the local of the signal for the given frame analysis.spectrum coefficients (and so their logarithm)we can convert them to the time domain uCosine Transform (DCT).

MFCC gives the acoustic representation of thdoes not contain any energy information respeech. In order to add the energy feature cochange in the cepstral features 13 delta featudelta features are added. For a signal y, thtime instant t1 and t2 is given by (6).

/"( ! 0120

232)

Difference between the frames is indicateddelta features and 39 double delta features(7) for a frame.

41 1 + 1 Cochlear filter banks are another important cextraction tool which produces the peculiar noisy and contaminated environment. Zero cduring the speech production irrespective of tutterance [13]. Figure 3 shows the realizatfilter bank and its various parts. Travelling wvelocity transformation filter T(z) and a secthe important building blocks of the cochlear

Properly uttered speech segment is travelling wave filter. This captured voicecascade manner in the various section of thfilter which possesses low pass filtering charaoff frequency is different for each section [1wave filter of mth degree is mathematicaltransfer function Hm(z), 567 86 + 79)86 + : + 86 : 860 ;6 + 86;6: + +

classes. There are two basic strategies for solving q-class problems with SVMs.

Multi-class SVMs: One to Others Take the training samples with the same label as one class and the others as the other class, then it becomes a two-class problem. For the q-class problem (q >2), q SVM classifiers are formed and denoted by SVM i, i=1, 2, 3, q. As for the testing sample x,

\]^ _] ^ + O] can be obtained by using SVM. The testing sample x belongs to jth class where

\`^ ! \]^-]abc

Multi-class SVMs: Pair-wise SVMs In the pair-wise approach, 2q machines are trained for q-

class problem. The pair-wise classifiers are arranged in trees, where each tree node represents an SVM. A bottom-up tree which is similar to the elimination tree used in tennis tournaments was originally proposed in for recognition of 3D objects and was applied to face recognition. Regarding the training effort, the one-to-others approach is preferable since only q SVMs have to be trained compared to 2q SVMs in the pair-wise approach. However, at runtime both strategies require the evaluation of q-1 SVMs [10]. Recent experiments on people recognition show similar classification performances for the two strategies.

RESULTS AND DISCUSSIONS The speech recognition system performance is measured in

terms of the recognition accuracy. The analysis is performed against various feature extraction scheme discussed above. Testing platform used for the analysis is the PC loaded with the MATLAB. The databases used are own created digit database containing the utterances of the digits starting from 0 to 9 from different speakers, alphabet database containing the utterances of the alphabets from A to Z from different speakers, and the NOIZEUS database which contained different utterances with various noise. The analysis is performed for different noisy environment including babble, car, and airport and subway noise. Also the statistical survey is given for the different approaches. The number of parameters plays significant role in system performance. The speech recognition experiments reported in this paper were performed on a speech corpus that used speech material from the own created Digit database, Alphabet database and four real-world noises from the NOIZEUS database. The Digit database contains over 150 samples of digit sequences spoken by 15 male. The alphabet contains the over 676 samples of alphabets spoken by 13 female. We used all recordings from the own created database. The speech files were initially filtered with the modified Intermediate Reference System (IRS) filters specified by the ITU-T P.862 and combined with the NOIZEUS corpus. The filtering process allowed us adding the noise extracted from the NOIZEUS database to the clean speech from own created

database without affecting the spectrum of the speech signals. The NOIZEUS database contains 30 sentences corrupted by eight different real-world noises at 0 dB, 5 dB, 10 dB, and 15 dB SNRs. In the analysis we have taken the four of them. The noise includes multi-talker babble, car, airport, and subway noise. Noise extraction was carried out by subtracting the clean utterances in this database from the noisy ones. The recordings from the Digit and Alphabet corpus were contaminated with different noise types at 4 different SNR levels (namely: 15 dB, 10 dB, 5 dB and 0 dB). We computed the energy of each recording from the Digit and Alphabet databases and added the noise at the selected SNR according to the signal energy level. The different combinations of the training and testing datasets were used for the analysis of the performance of the speech recognition system.

MFCC is the baseline approach presented in the literature. The performance is highly dependent on the different parameters like different set of training and testing datasets, number of filters, the window, the codebook size in the vector quantization, FFT size and the feature vectors etc.

Cochlear filter bank approach of the feature extraction is simulating the behavior of the human cochlea. The performance of the CFB is dependent on the number of the channels used and different noisy environment.

The experiments performed with the different number of training and testing dataset clearly indicates the recognition varies for different combinations. Table 5.1 gives recognition accuracy of the MFCC approach with Digit database. Another experiments performed on the Digit database implies the importance of the proper selection of the training and testing datasets. Table 5.2 gives the recognition accuracy of the various combinations of the training and testing datasets with cochlear filter bank approach. The recognition accuracy tends to increase with the number of testing and training datasets. The RA is highest with 15 numbers of training and testing datasets.

To validate the optimal number of training and testing dataset the experiments are performed with Alphabet database. The combinations are increased from 15 to 25. Table 5.3 gives the statistical analysis of the experiment. It shows that as the number of the training and testing datasets are increased the recognition accuracy increases to the certain level after that it shows the slight degradation in performance with further increase in the datasets. Further experiments with same database and cochlear filter banks as feature extraction algorithm validates that the optimal number of training and testing set is 15. Table 5.4 gives the comparative performance of the different combinations of the training and testing datasets. The value hits maximum when the number of combination is chosen to be 15.

The performance of both the algorithm is evaluated for the different noisy conditions. MFCC and its derivative as well as cochlear filter bank approach gave significant results. Figure 4 shows the variation in recognition accuracy for MFCC (13), MFCC (26), MFCC (39), and CFB with signal to noise ratio. The noise selected from the NOIZEUS database is babble noise. Similarly, recognition accuracy is evaluated and

292

illustrated for car noise, airport noise and su5, fig.6, fig.7 respectively. The comparisons show that the recognitioncochlear filter banks may be found to be presence of noise than mel frequency cepstral

TABLE 1. RECOGNITION ACCURACY FOR VARIOUDATABASE WITH MFCC

Testing dataset Training

Datasets

5 8 10

5 63.15 67.23 59.34 18 67.9 68.9 70.2 10 72.5 77.2 81.6 13 76.5 78.5 85.6 15 75.6 80.0 87.4

TABLE 2. RECOGNITION ACCURACY FOR VARIOUDATABASE WITH CFB

Testing dataset Training dataset 5 8 10

5 57.7 64.2 71.2 8 61.2 62.5 72.3

10 68.5 69.2 69.4 13 68.8 70.5 71.4 15 69.8 71.8 73.5

TABLE 3. RECOGNITION ACCURACY FOR VARALPHABET DATABASE WITH

TABLE 4. RECOGNITION ACCURACY FOR VARALPHABET DATABASE WITH


Datasets

5 10 15

5 60.1 70.6 73.8 10 63.8 74.1 74.9 15 74.5 82.6 91.5 20 77.2 79.7 84.1 25 80.4 80.9 85.7


Datasets

5 10 15

5 54.5 69.8 72.8 10 60.1 63.8 71.2 15 70.4 80.0 89.5 20 76 79.4 83.7 25 79.3 80.2 84.2

ubway noise in fig.

n accuracy for the significant in the

l coefficients.

US DATASETS OF DIGIT C

13 15

71.5 78.0 71.8 79.5 83.8 84.8 86.8 86.5 87.5 88.33

US DATASETS OF DIGIT

13 15 72.1 70.4 75.9 77.4 74.7 78.8 75.7 76.3 78.5 85.5

RIOUS DATASETS OF MFCC

RIOUS DATASETS OF H CFB

20 25

71.2 70.4 73.4 77.2 84.6 87.5 80.4 87.8 88.8 83.4

Fig. 4 The variation in the recogn

in presen

The statistical representaextraction techniques is giveneffect of the noise is clearly seIt varies significantly with the n

Fig. 5 The variation in the recogn

in pres

Fig. 6 The variation in the recognin presen

20 25

70.2 71.5 72.8 78.3 84.2 85.3 78.1 86.8 79.8 81.4

7476788082848688909294

15 10

Reco

gnit

ion

Acc

urac

y

SNR in

74767880828486889092

15 10

Reco

gnit

ion

Acc

urac

y

SNR i

72747678808284868890

15 10

Reco

gnit

ion

accu

racy

SNR

nition accuracy for different approaches nce of babble noise

ation of the various feature n in the following figures. The een on the recognition accuracy. noise.

nition accuracy for different approaches sence of car noise

nition accuracy for different approaches nce of airport noise

5 0

dB

MFCC(13)

MFCC(26)

MFCC(39)

CFB

5 0

in dB

MFCC(13)

MFCC(26)

MFCC(39)

CFB

5 0

R in dB

MFCC(13)

MFCC(26)

MFCC(39)

CFB

293

Fig. 7 The variation in the recognition accuracy foin presence of subway noi

CONCLUSIONS The two approaches for the feature extra

frequency cepstral coefficients and the cochave shown the significant results. The recogfound to be optimal for the 15 number of ctraining and testing datasets. MFCC srecognition accuracy for the clean and less nhowever the recognition degrades as theCochlear filter banks approach of the featurbetter recognition accuracy shows the imprin terms of the recognition accuracy as the nvalidate the noise robustness of the latter noises are added to the speech signal performed implies that cochlear filter bankfor the noisy environment.

ACKNOWLEDGMENT It is my pleasure to get this opportun

beloved and respected guide Prof. Sonalimparted valuable basic knowledge of Electrelated to Speech Processing. We are gratefuElect. and Telecommunication , SKNCOE, Pus infrastructural facilities and moral support

REFERENCES [1] S. B. Davis and P. Mermelstein, Comparis

representations for monosyllabic word recognispoken sentences, IEEE Trans. Acoust., Speech, ASSP- 28, no. 4, pp. 357-366, 1980.

[2] Gondhiraj R., Sathidevi P. S., Auditory Based bank for Speech Recognition using Neural Network

[3] S. Seneff, A computational model for the peripApplication to speech recognition research, in ProSpeech, Signal Processing, Tokyo, 1986, pp. 1983

[4] Woojay Jeon, Biing Hwang Juang, Speech AnalCentral Auditory System, IEEE Transaction Language Processing, Vol. 15, No. 6, August 2007

[5] O. Ghltza, Auditory nerve representation asprocessing, in Advances in Speech Signal ProcesM.Sondhi, Eds.). New York: Marcel Dekker, 1992

[6] M. J. Hunt and C. Lefebvre, A comparisonrepresentations for speech recognition with degrspeech, in Proc. Int. Con$ Acoust., SpeecGlasgow,1989, pp. 262-265.

707274767880828486889092

15 10 5 0

Reco

gnit

ion

Acc

urac

y

SNR in dB

or different approaches ise

action namely mel chlear filter banks gnition accuracy is combination of the shows the better noisy environment; e noise increases. re extraction have

roved performance noise increases. To

approach various and experiment

k may work better

nity to thank my l K. Jagtap who tronics specifically ul to Department of Pune for providing .

son of parametric ition in continuously Signal Processing, vol.

Wavelet Packet Filter-k, pp.666-673, 2007. pheral auditory system oc. Int. Conf. Acoustics 3-1986.(3) lysis in a Model of the on Audio,Speech and

7. s a basis for speech ssing (S. Furui and M. 2, pp. 453486. n of several acoustic raded and un degraded ch, Signal Processing,

[7] Chandwan and Suksri, Speech conf. on computer graphics, simu

[8] Alan Biem, An Application ofFilter-Bank-Based Speech Recogaudio processing, vol. 9, no. 2,pp.

[9] Ramlingam Hariharan, Noise RMultiresolution Feature Extractiaudio processing, vol. 9, No. 8, pp

[10] Chulhee Lee, Optimizing FeatuIEEE transactions on speech and88, January 2003.

[11] V. Tyagi and C. Wellekens (2005to spurious spectral componentsProc. IEEE Int. Conf. AcousticsICASSP 05,vol. 1, pp. 529532.

[12] Finnian Kelly, Naomi Harte, ARobust Speech Recognition, EU

[13] Chok-Ki Chan, Speech RecognEnergy, IEEE Transaction on AVol. 33, No. 10, 1985.

[14] James M. Kates, A Time-doTransactions on Signal Processin2592.

[15] Jorge Bernal-Chaves, Carmen Antoln and Fernando Diaz de Mdigit recognition using a HMMtutorial and research workshoBarcelona, 19-22 April, 2005,pp.1

MFCC(13)

MFCC(26)

MFCC(39)

CFB

recognition using MFCC, Internationl lation and modeling, 2012, pp.135-138 f Discriminative Feature Extraction to

gnition ieee transactions on speech and .96-110, February 2001. Robust Speech Parameterization Using on, IEEE transactions on speech and p.856-866, November 2001. ure Extraction for Speech Recognition, d audio processing, vol. 11, No. 1,pp.80-

5), On desensitizing the Mel Cepstrum s for Robust Speech Recognition, in , Speech, and Signal Processing, 2005,

A Comparison of Auditory Features for SIPCO August 2010.

nition Based on Zero-crossing Rate and Acoustic, Speech and Signal Processing,

omain digital cochlear Model, IEEE ng, Vol.39, No.12, Dec 1991,pp.2573-

Pelez-Moreno, Ascensin Gallardo-Maria, Multiclass SVM Based isolated-

M-guided segmentation, in Proc. ISCA op on non-linear speech processing, 137-144.

294

noise robust novel approach to speech recognition

Documents