isolated word recognition and sign language detection ... · linear prediction coding (lpc),...

© 2017, IJARCSMS All Rights Reserved 137 | P a g e

ISSN: 2321-7782 (Online) e-ISJN: A4372-3114 Impact Factor: 6.047

Volume 5, Issue 7, July 2017

International Journal of Advance Research in Computer Science and Management Studies

Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

Isolated Word Recognition and Sign Language Detection using

LPC and MFCC Rubal

1

Computer Science and Engineering

Maharaja Agrasen University

Baddi – India

Vineet Mehan2

Computer Science and Engineering

Maharaja Agrasen University

Baddi – India

Abstract: Speech technology and systems in human computer interaction have witnessed a steady and important

advancement over last two decades. Today, speech technologies are commercially available for boundless but interesting

range of tasks. These technologies permit machines to respond correctly and consistently to human voices, and provide

useful and valuable services The aim of this paper is to investigate the algorithms of speech recognition on isolated words

and then link the voice data with Indian sign language gestures. The author programmed the systems for algorithms of

speech recognition and their comparative study in MATLAB. Author used combination of multiple algorithm to get high

accuracy rate of speech detection. There are two major algorithms used in this thesis. One is based on Linear Predictive

Coding (LPC). The other one is to use the Mel Frequency Ceptrum Coefficient (MFCC). The classification technique used

are Gaussian Mixture Model (GMM) and k- nearest neighbor (KNN). Two conclusions can be drawn from the results, the

first, MFCC has poor performance with short recording and second, LPC has better performance with short recording.

Keywords: Automatic Speech Recognition (ASR), Gaussian Mixture Model (GMM), Linear Predictive Coding (LPC), mel-

frequency cepstral coefficients (MFCC), k- nearest neighbor (KNN).

I. INTRODUCTION

Speech is the most natural way of communication. It provides an efficient means of man-machine communication.

Generally, transfer of information between human and machine is accomplished via keyboard, mouse etc. But human can speak

more quickly instead of typing. Speech input offers high bandwidth information and relative ease of use [14]. Speech

recognition can be defined as the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

words. It is done by means of Algorithm implemented as a computer program. Speech processing is one of the exciting areas of

signal processing. In this paper, we present a speech recognition system for isolated words. Linear Predictive Coding (LPC) and

Mel Frequency Ceptrum Coefficients (MFCC) are used to train and recognize the speech and GMM is used to classify the

features from the speech-utterances. The core of all speech recognition systems consists of a set of statistical models

representing the various sounds of the language to be recognized. speech has temporal structure and can be encoded as a

sequence of spectral vectors spanning the audio frequency range, LPC and MFCC both are feature extraction technique [11].

Apart from introduction in section 1, the paper is organized as follows. Section 2 presents the Feature extraction. Section 3

describes the MFCC.

http://www.ijarcsms.com/

Rubal et al., International Journal of Advance Research in Computer Science and Management Studies

Volume 5, Issue 7, July 2017 pg. 137-145

© 2017, IJARCSMS All Rights Reserved ISSN: 2321-7782 (Online) Impact Factor: 6.047 e-ISJN: A4372-3114 138 | P a g e

II. SPEECH RECOGNITION SYSTEM ARCHITECTURE

The speech recognition system architecture is shown in fig. 1. It consists of two modules, training and testing module.

Training module generates the system model which is used during testing.

A. Preprocessing:

Speech-signal is an analog waveform which cannot be directly processed by digital systems. Thus preprocessing is necessary

to transform the input speech into a form that can be processed by recognizer. To accomplish this, the speech-input is digitized or

sampled. The sampled speech-signal is then processed through the first-order filters to spectrally flatten the signal. This process,

known as pre-emphasis, increases the magnitude of higher frequencies with respect to the magnitude of lower frequencies. The

next step is to divide the speech-signal into the frames with frame size ranging from 10 to 25 milliseconds and an overlap of

50%−70% between consecutive frames [15].

B. Feature Extraction:

The goal of feature extraction is to find a set of properties of an utterance that have acoustic correlations to the speech-signal,

that is parameters that can somehow be computed or estimated through processing of the signal waveform that is speech

waveform. Such parameters are termed as features. The feature extraction process is expected to discard irrelevant information to

the task (ex. Silence part, noise) while keeping the useful one. It includes the process of measuring some important characteristic

of the signal such as energy or frequency response (i.e. signal measurement), augmenting these measurements with some

perceptually meaningful derived measurements (i.e. signal parameterization), and statically conditioning these numbers to form

observation vectors [16].

Fig. 1 Speech Recognition System Architecture

C. Model Generation:

The model is generated using various approaches such as Hidden Markov Model (HMM), Artificial Neural Networks

(ANN), Dynamic Bayesian Networks (DBN), Support Vector Machine (SVM) and hybrid methods (i.e. combination of two or

more approaches) as is used in this paper i.e. combination of LPC, MFCC, GMM and KNN.

D. Speech Transcription:

Speech Transcription component recognizes the test samples based on the acoustic properties of word.

III. SPEECH FUTURE EXTRACTION

Convert the speech waveform to a set of features (at a considerably lower information rate) for further analysis [18]. This is

often referred as the signal-processing front end. The speech signal is a slowly timed varying signal (it is called quasi-stationary).

An example of speech signal is shown in Figure 2. When examined over a sufficiently short period of time (between 5 and 100

msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the

signal characteristic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the

most common way to characterize the speech signal.

Speech Signal

Preprocessing

Speech

Signal Feature

Extraction Phonetic Unit Recognition

Acoustic

Modelling

Language

Modelling

Decoded

Message




Fig. 2 Example of speech signal

A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, such as

Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known

and most popular, and will be described in this paper.

IV. MEL-FREQUENCY CEPSTRUM COEFFICIENTS PROCESSOR

A block diagram of the structure of an MFCC processor is given in Fig 3. The speech input is typically recorded at a

sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital

conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are

generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the

human ears [19]. In addition, rather than the speech waveforms themselves, MFFC‟s are shown to be less susceptible to

mentioned variations.

Fig. 3 Block diagram of the MFCC processor

Frame blocking

In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M

(M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps

it by N - M samples and so on. This process continues until all the speech is accounted for within one or more frames. Typical

values for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M =

100.used for recognition [14].

Windowing

The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the

beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the

signal to zero at the beginning and end of each frame. If we define the window as 10),( Nnnw , where N is the

number of samples in each frame, then the result of windowing is the signal

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Time (second)

mel

cepstrum

mel

spectrum

framecontinuous

speech

Frame

Blocking

Windowing FFT spectrum

Mel-frequency

WrappingCepstrum




10),()()( Nnnwnxny ll

Typically the Hamming window is used, which has the form:

10,1

2cos46.054.0)(

Nn

N

nnw

Fast Fourier Transform (FFT)

The next processing step is the Fast Fourier Transform, which converts each frame of N samples from the time domain into

the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT), which is defined on the

set of N samples {xn}, as follow:

1

0

/2 1,...,2,1,0,N

n

Nknj

nk NkexX

In general Xk‟s are complex numbers and we only consider their absolute values (frequency magnitudes). The resulting

sequence {Xk} is interpreted as follow: positive frequencies 0 ≤ f < Fs/2 correspond to values

0 ≤ n ≤ N/2, while negative frequencies -FS/2 < f <0 correspond to N/2+1 ≤ n ≤ N-1. Here, Fs denotes the sampling

frequency.

The result after this step is often referred to as spectrum or periodogram.

Mel frequency Wrapping

As mentioned above, psychophysical studies have shown that human perception of the frequency contents of sounds for

speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch

is measured on a scale called the „mel‟ scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a

logarithmic spacing above 1000 Hz

Fig. 4 An example of mel-spaced filterbank

One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on the mel-scale (see Figure

3). That filter bank has a triangular bandpass frequency response, and the spacing as well as the bandwidth is determined by a

constant mel frequency interval. The number of mel spectrum coefficients, K, is typically chosen as 20. Note that this filter

bank is applied in the frequency domain, thus it simply amounts to applying the triangle-shape windows as in the Figure 4 to the

spectrum. A useful way of thinking about this mel-wrapping filter bank is to view each filter as a histogram bin (where bins

have overlap) in the frequency domain.

0 1000 2000 3000 4000 5000 6000 70000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Mel-spaced filterbank

Frequency (Hz)




Cepstrum

In this final step, we convert the log mel spectrum back to time. The result is called the mel frequency cepstrum

coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral

properties of the signal for the given frame analysis. Because the mel spectrum coefficients (and so their logarithm) are real

numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if we denote those

mel power spectrum coefficients that are the result of the last step are 1,...,2,0,~

0 KkS , we can calculate the MFCC's,

,~nc as

Where, n=0,1…..,K-1, Note that we exclude the first component, ,~0c from the DCT since it represents the mean value of

the input signal, which carried little speaker specific information.

V. LINEAR PREDICTIVE COEFFICIENT (LPC)

It is a powerful, robust, accurate, reliable and popular tool for speech recognition, compression and synthesis. The main

objective of LPC is frame-based analysis of the input speech signal to generate observational vectors. [21] It is a very simplified

method and belongs to spectral analysis part.

LPC technique can provide estimation of poles of vocal tract transfer function. Each sample in LPC can be approximated as

past samples in linear combination. In order to implement LPC and generate the features the input speech signals needs to pass

through pre-emphasizer, the output of pre-emphasizer acts as the input to frame blocking where the signal is blocked into

frames of N samples.

The next step after frame blocking is windowing where each frame is windowed in order to reduce signal discontinuity at

the beginning and end of every frame. Hamming frame is an example of typical frame. After windowing each windowed frame

is auto correlated and the highest autocorrelation value gives the order of LPC analysis and finally LPC coefficients are derived.

[26]

Fig. 5 Block Diagram of LPC implementation

VI. IMPLEMENTATION STRATEGIES

We came up with the idea of implementation of theories we have studied so far regarding LPC, GMM, KNN and MFCC.

Here is an algorithm which explains pretty much of implementation strategy [27]. The steps of proposed algorithm are described

as follows:

1. Start

K k n S c

K

k k n

2

1 cos )

~ (log ~

1




2. Load speech dataset

3. Then applied MFCC algorithm for finding speech features i.e. training dataset.

4. Apply GMM for classification of MFCC

5. Extract MFCC for test data set

6. Classify MFCC test data on Mahanalobis distance

7. Similarly, LPC algorithm is applied for finding speech features.

8. Apply GMM for classification of LPC

9. Apply KNN‟s mahanalobis algorithm for LPC

10. Apply LPC for test data

11. Display output

12. Stop

As described above in the algorithm for implementation strategy, author will load the dataset of speech; this dataset

comprises voice samples of individuals who spoke six letters namely A, B, C, Five, Point, V. Each letter was spoken 25 times i.e.

each individual gave 150 voice samples. Here in this thesis we have voice data of three persons i.e. total voice samples used are

450. Each person gave voice sample in continuous mode hence each sample is of its own type, because of this variation in dataset

our system will recognize speech at a high accuracy rate. Individuals who provided data sample are named Rubal, F, Neeraj,

voice sample of first person were recorded in real environment i.e. in noised environment whereas later two persons‟ data are in

silent environment or controlled environment.[29]

In training section we fed images in jpeg format of Indian sign language gestures of hands, using these data sets our system

will be trained. These gestures are picked from internet, using google search engine. In testing section author stored images of

Indian sign language by using own hands so that to make a clear difference between trainig data and testing data.

Now author will apply MFCC over the dataset and training set to extract the features of voice samples. Then GMM will be

used for classification of MFCC, after MFCC a mahanalobis algorithm will be used to calculate the distance of MFCC vectors.

Same process will be used for the testing data set. In the next step author will calculate the LPC feature extraction and GMM

applied over the LPC for classification and Mahanalobis for distance calculation. Training and testing data set both undergoes the

same procedure.

We can understand the above explained algorithm in a better way from a flow chart. To construct a flow chart from the

algorithm we need to have basic knowledge of flow chart construction. Fig. 6 shows the flow chart for implementation of above

designed algorithm.

Automatic speech recognition system enables a machine to hear, comprehend and respond to the speech signal, which is

provided by the speaker as input. The methodology of the speech recognition system can be understood by the following

flowchart i.e. Fig 6. The input is given to the system; this input is in the form of speech signal that is then divided into smaller

components known as frames, also known as the analysis part. Then the feature extraction is performed and useful features are

extracted out of the voice signal.

After feature extraction the feature vectors are divided into two categories, training and testing [30]. In training speech

modeling is performed and in testing pattern matching is done. During recognition phase the test speech data is matched with the

training models and the result is given according to the best match. Thus, the speech recognition system can be classified mainly

into four phases.




1. Analysis 2. Feature Extraction 3. Modeling 4. Testing or Matching

Fig. 6 Flow chart for implementation strategy

VII. RESULTS

This section describes the performance accuracy achieved via proposed methodology. Table 4.1 shows the comparative

study of proposed work with other algorithms, named as MFCC algorithm, LPC algorithm and Correlation method. To find the

recognition accuracy (R), the following formula is used.

R = (No. of times digit recognized / Total no. of trials) *100

TABLE 1Accuracy Rate determined by the system

VIII. CONCLUSION

Automatic Speech Recognition has been under scrutiny for many years but still completely accurate and efficient systems

have not been created. In this paper we have studied speech recognition system in depth and also few feature extraction

techniques like LPC and MFCC. It was observed that each technique has its own merits and demerits. There are many

limitations with the systems, such as it gets affected by background noise and give less efficient results, it also cannot identify

speeches from various users due to speech overlap and the system also undergoes problems while detecting accent and

pronunciation of the speakers. It can also be concluded from the study that majority of work in speech recognition has been

done for English language and comparatively less work has been carried out for other languages like Arabic and Indian. It can

Start

Load Dataset

Load Testing/Training Data

User

Name

Test sample

(Each word)

Recognized by MFCC

A B C Five Point V

Recognized by LPC

A B C Five Point V

Accuracy

Rate of

MFCC(R

)

Accuracy

Rate of

LPC(R)

Rubal 5 1 1 1 4 0 1

5 5 5 5 5 5

26.67 100

F 5 0 0 1 0 1 1

5 5 5 5 5 5

10 100

Neeraj 5 5 5 5 5 5 5

5 4 4 4 5 5

100 90

Apply MFCC Apply LPC

Apply GMM

Apply Mahanalobis algorithm

Display Output

Output




also be observed from the study that for English language the recognition is most accurate and hence the rate is higher as

compared to others. Instead of implementing single feature extraction techniques for ASR and there is a need to develop

combination of one or more techniques i.e. hybrid techniques that will make the system more reliable, robust and provide more

accurate results. This thesis includes comparison between two-feature extraction techniques along with detailed study of these

feature extraction techniques and the main aim of this thesis is to provide researchers working in this area with an understanding

of differences between these commonly used feature extraction techniques. We managed to get an accuracy of around 90%

using LPC and around 40% for MFCC using the GMM classification. Comparing LPC with MFCC, the performance of MFCC

is better.

A C KNOWLEDGEMENT

The author would first like to express very profound gratitude to her parents and to her friends for providing her with

unfailing support and continuous encouragement throughout her years of study and through the process of researching and

writing this dissertation. This accomplishment would not have been possible without them.

References

1. Deng, Li; Douglas O'Shaughnessy, “Speech processing: a dynamic and optimization-oriented approach”, Marcel Dekker. pp. 41–48. ISBN 0-8247-4040-8,

2003.

2. Tai, Hwan-Ching; Chung, Dai-Ting, "Stradivari Violins Exhibit Formant Frequencies Resembling Vowels Produced by Females", Savart Journal. 1 (2),

June 14, 2012.

3. Min Xu; et al., "HMM-based audio keyword generation", In Kiyoharu Aizawa; Yuichi Nakamura; Shin'ichi Satoh. Advances in Multimedia Information

Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia (PDF), Springer. ISBN 3-540-23985-5, 2004.

4. Sahidullah, Md.; Saha, Goutam, "Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker

recognition Speech Communication", 54 (4): 543–565. doi:10.1016/j.specom.2011.11.004, May 2012.

5. Fang Zheng, Guoliang Zhang and Zhanjiang Song, "Comparison of Different Implementations of MFCC," J. Computer Science & Technology, 16(6):

582–589, 2001.

6. S. Furui, "Speaker-independent isolated word recognition based on emphasized spectral dynamics", 1986.

7. European Telecommunications Standards Institute, “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-

end feature extraction algorithm; Compression algorithms”. Technical standard ES 201 108, v1.1.3, 2003.

8. T. Ganchev, N. Fakotakis, and G. Kokkinakis, "Comparative evaluation of various MFCC implementations on the speaker verification task," in 10th

International Conference on Speech and Computer (SPECOM 2005), Vol. 1, pp. 191–194, 2005.

9. Meinard Müller, “Information Retrieval for Music and Motion”, Springer p. 65, ISBN 978-3-540-74047-6, 2007.

10. V. Tyagi and C. Wellekens, “On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition, in Acoustics, Speech,

and Signal Processing”, 2005 Proceedings (ICASSP05). IEEE International Conference on, vol. 1, pp. 529–532, 2005.

11. Mermelstein, "Distance measures for speech recognition, psychological and instrumental," in Pattern Recognition and Artificial Intelligence, C. H. Chen,

Ed., pp. 374–388. Academic, New York, 1976.

12. S.B. Davis, and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," in

IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp. 357–366, 1980.

13. J. S. Bridle and M. D. Brown, "An Experimental Automatic Word-Recognition System", JSRU Report No. 1003, Joint Speech Research Unit, Ruislip,

England, 1974.

14. Nelson Morgan; Hervé Bourlard & Hynek Hermansky, "Automatic Speech Recognition: An Auditory Perspective", In Steven Greenberg & William A.

Ainsworth. Speech Processing in the Auditory System. Springer. p. 315. ISBN 978-0-387-00590-4, 2004.

15. L. C. W. Pols, "Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words," Doctoral dissertion, Free University, Amsterdam, The

Netherlands, 1966.

16. R. Plomp, L. C. W. Pols, and J. P. van de Geer, "Dimensional analysis of vowel spectra." J. Acoustical Society of America, 41(3):707–712, 1967.

17. Yuhuan Zhou, Jinming Wang, Xiongwei Zhang, “Research on Adaptive Speaker Identification Based on GMM”, 2009 International Forum on Computer

Science-Technology and Applications, IEEE, 2009.

18. Hiroshi Matsumoto and Masanora Moroto, “Evaluation of mel-lpc cepstrum in a large vocabulary continuous speech recognition”, IEEE published, 2001.

19. Kartiki Gupta, Divya Gupta,” An analysis on LPC, RASTA and MFCC techniques in Automatic Speech Recognition System”, IEEE published, 2016.

20. Jose B. Trangol Curipe and Abel Herrera Camacho, “Feature Extraction Using LPC-Residual and Mel Frequency Cepstral Coefficients in Forensic

Speaker Recognition”, International Journal of Computer and Electrical Engineering, Vol. 5, No. 1, February 2013.




21. Abhishek Dixit, Abhinav Vidwans and Pankaj Sharma,” Improved MFCC and LPC Algorithm for Bundelkhandi Isolated Digit Speech Recognition”,

International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) – 2016

22. Anuradha P Nair, Shoba Krishnan and Zia Saquib, “MFCC Based Noise Reduction in ASR Using Kalman Filtering”, Conference on Advances in Signal

Processing (CASP), IEEE published, 2016.

23. M. K. Brown and L. R. Rabiner, “On the Use of Energy in LPC-Based Recognition of Isolated Words”, The bell system technical journal Vol. 61, No. 10,

December 1982.

24. Saptarshi Boruah and Subhash Basishtha, “A Study on HMM based Speech Recognition System”, IEEE International Conference on Computational

Intelligence and Computing Research, 2013.

25. Kuldeep Kumar and R. K. Aggarwal, “Hindi speech recognition system using HTK”, International Journal of Computing and Business Research ISSN

(Online): 2229-6166 Volume 2 Issue 2 May 2011.

26. Bishnu Prasad Das and Ranjan Parekh, “Recognition of Isolated Words using Features based on LPC, MFCC, ZCR and STE, with Neural Network

Classifiers”, International Journal of Modern Engineering Research (IJMER) Vol.2, Issue.3, May-June 2012 pp-854-858 ISSN: 2249-6645.

27. Helping website, “https://sites.google.com/site/autosignlan”, Automatic Sign Language Detection.

28. Helping website “http://www.ifp.uiuc.edu/~minhdo/teaching/speaker_recognition”, DSP Mini-Project: An Automatic Speaker Recognition System,

29. Helping website, “https://www.wikipedia.com/speechrecognition”, Speech Recognition and its history.

30. Suma Swamy and K.V Ramakrishnan, “An efficient Speech Recognition System”, Computer Science & Engineering: An International Journal (CSEIJ),

Vol. 3, No. 4, August 2013.

AUTHOR (S ) P ROFILE

Rubal, received the B.Tech degree in Computer Science and Technology from Uttar Pradesh

Technical University in 2014. She is pursuing the M.tech in Computer Science and Technology from

Maharaja Agrasen University, Baddi, Himachal Pradesh, India.

Vineet Mehan, received the B.Tech degree in Information Technology from Kurukshetra University in

2003. He received the M.E. degree in Computer Science and Engineering from NITTTR, Panjab

University in 2008. He completed Ph.D. degree in Computer Science and Engineering from Dr. B.R.

Ambedkar National Institute of Technology, Jalandhar IN 2016. His research interests includes Image

Processing, Watermarking and Cryptography.

isolated word recognition and sign language detection ... · linear prediction coding (lpc),...

Documents