final report on speech recognition project
TRANSCRIPT
![Page 1: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/1.jpg)
Final Report on Speech Recognition Project
Ceren Burçak Dağ
040100531
![Page 2: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/2.jpg)
Introduction
• The design of a pre-processing, clustering and a classifier blocks of a speech recognition system is aimed in this project. The computations are made in C/C++ by the author herself and the visualization materials are generated in MATLAB. The documentation of the codes is given in the appendix.
![Page 3: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/3.jpg)
Pre-Processing Block
![Page 4: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/4.jpg)
Silence trimmed
![Page 5: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/5.jpg)
RMS applied
![Page 6: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/6.jpg)
Hanning windowed
![Page 7: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/7.jpg)
FFT taken
![Page 8: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/8.jpg)
Relation between Mel and Hertz scales
![Page 9: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/9.jpg)
Triangular filters
![Page 10: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/10.jpg)
Cepstrum Analysis and Homomorphic Deconvolution
• Nonlinear signal processing technique.
• Useful in speech processing and recognition applications.
• Bogert, Healy and Tukey defined cepstrum and quefrency in 1963.
• Oppenheim (1964) defined homomorphic systems.
![Page 11: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/11.jpg)
• «The transformation of a signal into its cepstrum is actually a homomorphic transformation that maps the convolution into addition.»
• Let us have a sampled signal, x[n] that is composed of the sum of the signal v[n] and an echo (shifted and scaled copy) of it:
![Page 12: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/12.jpg)
Since the convolution in time domain corresponds to the multiplication in the frequency domain
Take the magnitude of both sides,
![Page 13: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/13.jpg)
Nonlinear technique applied in finding the cepstrum is the logarithm. So, take the logarithms of each side,
Since the logarithm of multiplication is just the addition of the terms,
Define:
![Page 14: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/14.jpg)
Now if go back to time domain, we should use I-DTFT.
Finally, one can obtain the following quefrency domain equation:
cepstrum
![Page 15: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/15.jpg)
Speech Production Model based on Cepstrum Analysis
• Voiced sounds are produced by exciting the vocal tract with quasi-periodic pulses of air flow caused by the opening and closing of the glottis.
• Fricative sounds are produced by forming a constriction somewhere in the vocal tract and forcing air through the constriction so that the turbulence is created and therefore producing a noise-like excitation.
• Plosive sounds are produced by completely closing of the vocal tract, building up pressure behind the closure, and then suddenly releasing the pressure.
![Page 16: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/16.jpg)
Figure 17: Discrete-time speech production model, picture courtesy of Oppenheim, Discrete-Time Signal Processing, [5].
![Page 17: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/17.jpg)
Parameters in the model
• 1. The coefficients of V(z), or mathematical representation of the vocal tract which is simply a general IIR filter. So, the locations of poles and zeros change the sound.
![Page 18: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/18.jpg)
• 2. The mode of excitation of the vocal tract system: a periodic impulse train or random noise.
• 3. The amplitude of the excitation signal.
• 4. The pitch period of the speech excitation for voiced speech, namely the frequency of the voiced sound.
![Page 19: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/19.jpg)
Let us assume that the model is valid and fixed over a short time period of 10 ms, so we can apply the cepstrum analysis to a short segment of length L (=1024) samples.
Apply window to the resulting signal in order w[n] to taper smoothly to zero at both ends. Therefore, the input to the homomorphic system will be,
![Page 20: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/20.jpg)
If we further assume w[n] varies slowly with respect to the variations of v[n], the cepstrum analysis reduces to,
If the p[n] is a train of impulses,
![Page 21: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/21.jpg)
By applying cepstrum analysis, we obtain the following equation.
![Page 22: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/22.jpg)
MFCC and delta coefficients calculation
![Page 23: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/23.jpg)
Clustering and Classification
• K-Means clustering is applied to each training file to generate the confusion matrix and tables.
• KNN is applied to recognize some test words.
![Page 24: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/24.jpg)
Vowels, unequal a-priori probabilities
![Page 25: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/25.jpg)
Vowels, equal a-priori probabilities, each has 97 feature vectors
![Page 26: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/26.jpg)
Vowels, equal a-priori probabilities, each has 194 feature vectors
![Page 27: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/27.jpg)
Consonants, unequal a-priori probabilities
![Page 28: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/28.jpg)
Consonants, equal a-priori probabilities, each has 194 feature vectors
![Page 29: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/29.jpg)
Confusion table for consonants
![Page 30: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/30.jpg)
KNN classification
![Page 31: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/31.jpg)
![Page 32: Final Report on Speech Recognition Project](https://reader030.vdocuments.us/reader030/viewer/2022033023/55cf9759550346d033911f82/html5/thumbnails/32.jpg)
References • [1] Numerical Recipes in C++: The Art of Scientific Computing. William Press. Saul Teukolsky. William
Vetterling. Brian Flannery. 2002.
• [2] S. S. Stevens, J. Volkmann, E. B. Newman, A scale for the Measurement of the Psychological Magnitude Pitch, J. Acoust. Soc. Am. Vol. 8, issue 3, pp. 185-190, 1937.
• [3] Huang, X., Acero, A. and Hon, H. (2001), Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, New Jersey.
• [4] L. Muda, M. Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Ceptral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, Journal of Computing, V. 2, i. 3, p. 138-143, 2010.
• [5] Oppenheim A. V., Schafer, R. W., \emph{Discrete-Time Signal Processing}, Pearson International 3. Edition.
• [6] Davis S. B., Mermelstein, P., Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, Haskins Laboratories, Status Report on Speech Research, 1980.
• [7] J. Ye, Speech Recognition Using Time Domain Features from Phase Space Reconstructions, PhD thesis for Marquette University, Wisconsin - US, 2004.
• [8] An Introduction to Speech Recognition, B. Plannerer, 2005.
• [9] R. O. Duda. P. E. Hart and D. G. Stork, \emph{Pattern Classification}, John Wiley \& Sons. 2000.
• [10] L. Rabiner \& B.-H. Juang, \emph{Fundamentals of Speech Recognition}, Prentice Hall Signal Processing Series.
• [11] H. Artuner, The Design and Implementation of a Turkish Speech Phoneme Clustering System, PhD thesis for Hacettepe Universitesi - TR, 1994.