final report on speech recognition project

Final Report on Speech Recognition Project

Ceren Burçak Dağ

040100531

Introduction

• The design of a pre-processing, clustering and a classifier blocks of a speech recognition system is aimed in this project. The computations are made in C/C++ by the author herself and the visualization materials are generated in MATLAB. The documentation of the codes is given in the appendix.

Pre-Processing Block

Silence trimmed

RMS applied

Hanning windowed

FFT taken

Relation between Mel and Hertz scales

Triangular filters

Cepstrum Analysis and Homomorphic Deconvolution

• Nonlinear signal processing technique.

• Useful in speech processing and recognition applications.

• Bogert, Healy and Tukey defined cepstrum and quefrency in 1963.

• Oppenheim (1964) defined homomorphic systems.

• «The transformation of a signal into its cepstrum is actually a homomorphic transformation that maps the convolution into addition.»

• Let us have a sampled signal, x[n] that is composed of the sum of the signal v[n] and an echo (shifted and scaled copy) of it:

Since the convolution in time domain corresponds to the multiplication in the frequency domain

Take the magnitude of both sides,

Nonlinear technique applied in finding the cepstrum is the logarithm. So, take the logarithms of each side,

Since the logarithm of multiplication is just the addition of the terms,

Define:

Now if go back to time domain, we should use I-DTFT.

Finally, one can obtain the following quefrency domain equation:

cepstrum

Speech Production Model based on Cepstrum Analysis

• Voiced sounds are produced by exciting the vocal tract with quasi-periodic pulses of air flow caused by the opening and closing of the glottis.

• Fricative sounds are produced by forming a constriction somewhere in the vocal tract and forcing air through the constriction so that the turbulence is created and therefore producing a noise-like excitation.

• Plosive sounds are produced by completely closing of the vocal tract, building up pressure behind the closure, and then suddenly releasing the pressure.

Figure 17: Discrete-time speech production model, picture courtesy of Oppenheim, Discrete-Time Signal Processing, [5].

Parameters in the model

• 1. The coefficients of V(z), or mathematical representation of the vocal tract which is simply a general IIR filter. So, the locations of poles and zeros change the sound.

• 2. The mode of excitation of the vocal tract system: a periodic impulse train or random noise.

• 3. The amplitude of the excitation signal.

• 4. The pitch period of the speech excitation for voiced speech, namely the frequency of the voiced sound.

Let us assume that the model is valid and fixed over a short time period of 10 ms, so we can apply the cepstrum analysis to a short segment of length L (=1024) samples.

Apply window to the resulting signal in order w[n] to taper smoothly to zero at both ends. Therefore, the input to the homomorphic system will be,

If we further assume w[n] varies slowly with respect to the variations of v[n], the cepstrum analysis reduces to,

If the p[n] is a train of impulses,

By applying cepstrum analysis, we obtain the following equation.

MFCC and delta coefficients calculation

Clustering and Classification

• K-Means clustering is applied to each training file to generate the confusion matrix and tables.

• KNN is applied to recognize some test words.

Vowels, unequal a-priori probabilities

Vowels, equal a-priori probabilities, each has 97 feature vectors

Vowels, equal a-priori probabilities, each has 194 feature vectors

Consonants, unequal a-priori probabilities

Consonants, equal a-priori probabilities, each has 194 feature vectors

Confusion table for consonants

KNN classification

References • [1] Numerical Recipes in C++: The Art of Scientific Computing. William Press. Saul Teukolsky. William

Vetterling. Brian Flannery. 2002.

• [2] S. S. Stevens, J. Volkmann, E. B. Newman, A scale for the Measurement of the Psychological Magnitude Pitch, J. Acoust. Soc. Am. Vol. 8, issue 3, pp. 185-190, 1937.

• [3] Huang, X., Acero, A. and Hon, H. (2001), Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.

• [4] L. Muda, M. Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Ceptral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal of Computing, V. 2, i. 3, p. 138-143, 2010.

• [5] Oppenheim A. V., Schafer, R. W., \emph{Discrete-Time Signal Processing}, Pearson International 3. Edition.

• [6] Davis S. B., Mermelstein, P., Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, Haskins Laboratories, Status Report on Speech Research, 1980.

• [7] J. Ye, Speech Recognition Using Time Domain Features from Phase Space Reconstructions, PhD thesis for Marquette University, Wisconsin - US, 2004.

• [8] An Introduction to Speech Recognition, B. Plannerer, 2005.

• [9] R. O. Duda. P. E. Hart and D. G. Stork, \emph{Pattern Classification}, John Wiley \& Sons. 2000.

• [10] L. Rabiner \& B.-H. Juang, \emph{Fundamentals of Speech Recognition}, Prentice Hall Signal Processing Series.

• [11] H. Artuner, The Design and Implementation of a Turkish Speech Phoneme Clustering System, PhD thesis for Hacettepe Universitesi - TR, 1994.

final report on speech recognition project

Documents

speech excitation

excitation signal

speech processing

voiced speech

time domain

signal vn

speech recognition system

vocal tract system