comparative study of lpcc and fused mel feature sets … › ... › ijecet_06_09_010.pdf ·...

http://www.iaeme.com/IJECET/index.asp 82 [email protected]

International Journal of Electronics and Communication Engineering & Technology

(IJECET)

Volume 6, Issue 9, Sep 2015, pp. 82-96, Article ID: IJECET_06_09_010

Available online at

http://www.iaeme.com/IJECETissues.asp?JType=IJECET&VType=6&IType=9

ISSN Print: 0976-6464 and ISSN Online: 0976-6472

© IAEME Publication

COMPARATIVE STUDY OF LPCC AND

FUSED MEL FEATURE SETS FOR

SPEAKER IDENTIFICATION USING GMM-

UBM

Anagha S. Bawaskar

Dept. of Electronics & Telecommunication,

M. E. S. College of Engineering, Pune, India

Prabhakar N. Kota

Dept. of Electronics & Telecommunication,

M. E. S. College of Engineering, Pune, India

ABSTRACT

Biometrics identifiers are typically measurable characteristics used to

label and describe the individual respectively. Biometric identifiers are the

combination of both physiological and behavioral characteristics. The

physiological characteristics include the characteristics related to the shape

of the body. There are various examples for physiological characteristics but

not limited. Examples include fingerprint, palm, hand geometry, iris

recognition, and retina. Behavioral characteristics are related to the pattern

behavior of a person including but not limited to typing rhythm and voice.

Biometric system technology is now a day’s a well-furnished technology, it

analyzes human body characteristics. It is also known as one of the active

biometric tasks. There is much speech related activities such as language

recognition, speech recognition, and speaker recognition respectively.

Speaker recognition superficially defines as to identify the accurate speaker

from the group of various people. It is a very broad term and is further

classified as speaker identification and speaker verification. The paper is

concentrating on the term speaker identification. The main aim is to identify

the accurate speaker from the given speech samples. These samples are

obtained by extracting features and are used for modeling purpose. Standard

database TIMIT is being used for identification. The paper comprises of

various algorithms for feature extraction, they are Mel Frequency Cepstral

Coefficients (MFCC), Inverse Mel Frequency Cepstral Coefficient (IMFCC)

and linear predictive Cepstral Coefficients (LPCC). The term Fusion came

from the combination of the two algorithms namely MFCC and IMFCC. The

Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using

GMM-UBM


comparison is made among the results of Fusion and LPCC respectively.

From the result, it is seen on an average Fusion is better than LPCC.

Index Terms: Gaussian Mixture Models (GMM), Inverted Mel Frequency

Cepstral Coefficients (IMFCC), Linear Predictive Cepstral Coefficients

(LPCC), Mel Frequency Cepstral Coefficients (MFCC), Universal

Background Model (UBM)

Cite this Article: Anagha S. Bawaskar and Prabhakar N. Kota. Comparative

Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using

GMM-UBM, International Journal of Electronics and Communication

Engineering & Technology, 6(9), 2015, pp. 82-96.

http://www.iaeme.com/IJECET/issues.asp?JType=IJECET&VType=6&IType=

9

1. INTRODUCTION

Nowadays various biometrics systems are there. In last decades, an increasing interest

in security system has risen. For the security purpose, it includes various biometric

schemes. Biometrics refers to technologies that measure and analyzes human body

characteristics. There are many biometric methods existing in the world, they are face

recognition, eye retina and iris recognition, fingerprint, DNA, hand measurements etc.

for authentication purpose. These are the one of the well-known biometric method,

adding to this list one of the well-known method is Speech signal processing. Speech

is one of the natural forms used in communication. Speech recognition has application

in voice identification in ordinary personal computers to biometric and forensic

applications. Recently the development has been seen in a security system. There are

two main techniques in speech processing, one is speaker recognition and the other is

speech recognition, in this paper the main focus is given on speaker recognition. The

speaker recognition is further divided only speaker identification and speaker

verification.

Speaker identification is the technique in which not registered speaker is being

identified and Speaker verification a claimed speaker is being identified. The speaker

identification is in a ratio of 1: N while speaker verification is in 1:1 ratio

respectively. In this paper text- independent, speaker identification system is used. In

speaker identification, the specific characteristics of voice are being extracted from

the given sample of voice of speaker known as feature extraction. After this, the

speaker model is trained and stored into the system database. The extraction of the

voice of speaker yields us the specific information of the speakers’ voice called

feature vectors. The speaker vectors represent the specific information of the speaker

which is based on the single or many things from the following: vocal tract, excitation

source, and behavioral traits. All speaker recognition systems use the set of scores to

enhance the probability and reliability of the recognizer. Before feature extraction, the

system goes through the pre-processing stage. An important role is played by Pre-

processing in speaker identification and helps to reduce the amount of variation in the

database which does not contain the important information about speech; it is

considered to be a good practice. The preprocessing removes the irrelevant

information respectively.

The various algorithms used for feature extraction are Mel-frequency Cepstral

Coefficients (MFCC), Inverse Mel Frequency Cepstral Coefficients (IMFCC), and

Linear Predictive Cepstral Coefficients (LPCC). In this paper, feature extraction is

Anagha S. Bawaskar and Prabhakar N. Kota


done by using all these above-mentioned algorithms. Investigations by the researchers

find out speaker specific complementary information relative to MFCC that are called

as Inverse Mel Frequency Cepstral Coefficients (IMFCC) respectively.

Complementary information is used for combining the score models and for

combining the score models along with the MFCC and is named as Fused Mel Feature

Set. Such models are nothing but the mathematical representation of the particular

system [1].The inverse filter bank method is being used for capturing this

complementary information from high-frequency part of the energy spectrum. IMFCC

captures the information which is neglected by MFCC. The respective features are

modeled by using Gaussian Mixture Model and Universal Background Model (GMM-

UBM). All algorithms used in this paper are based on Gaussian filters only. The

results are verified in standard database TIMIT.

The final results are the comparison between LPCC results and Fused Mel Feature

Set results and accurate results are noted down. The next section of this paper is

followed by Fused Mel Feature set using Gaussian Filters and Linear predictive

Cepstral coefficient using Gaussian filters. It is followed by comparative results of

both Fused Mel feature Set and Linear Predictive Cepstral Coefficient.

2. FEATURE EXTRACTION AND FILTER DESIGN

To represent any speech signal in a finite number of measures is the goal of feature

extraction. Features are nothing but the representation of the spectrum of a speech

signal in each window frame. The Cepstral vectors are derived from a filter bank that

has been designed according to some model of the auditory system [2]. Most of the

feature extraction methods use a standard triangular filter. The triangular filters are

used for filtering the spectrum of the speech signal which simulates the characteristics

of a human ear. But this also has some disadvantages. These are, they give sharper or

crisp partition in an energy spectrum, due to this some information is lost. In this

paper, Gaussian filters are used. The crisp and sharp transition in an energy spectrum

is avoided if we use Gaussian filters instead of triangular filters. This gives results in a

smoother adaptation from one sub band to other. Because of this adaptive property,

there is always one type of correlation being maintained. These correlations are

maintained from the mid points of the triangular filters at the base of it as well as from

the end points of triangular filters. Mathematical calculations in Gaussian filters are

simple. Hence because of such advantages over triangular filters we use Gaussian

Filters. The motivation for using Mel-Frequency Cepstrum Coefficients was due to

the fact that the auditory response of the human ear resolves frequencies nonlinearly.

The mapping from linear frequency to Mel Frequency is defined as [3].

)700

1(log2595 10

ffmel (1)

Where;

The subjective pitch in Mel corresponding to f is melf , this frequency in actual

measured in Hz.

MFCCs are one of the more popular parameterization methods used by

researchers in the speech technology field. It has the benefit that it is capable of

capturing the phonetically important characteristics of speech. MFCC are band-

limiting in nature and can easily be employed to make it suitable for applications like

a telephone.


GMM-UBM


Generally, the feature extraction using MFCC uses a triangular filter. The

triangular filter has some characteristics like, asymmetrically tapering which also do

not provide the correlation between the sub-bands and the nearby spectral

components. Because of all this, information loss occurred there. By using Gaussian

filters, one profit is that it avoids drawbacks and losses seen in a triangular filter.

Gaussian filters are tapering towards both the end and provide correlation between

sub-bands and its nearby spectral components [4].

The IMFCC is one of the feature extraction techniques. It captures the

complementary information present in the high-frequency part of the spectrum. The

figure below shows the steps involved in feature extraction of both Gaussian MFCC

and IMFCC features. Let the input speech signal be y (n), where n=1, M. it represent

the preprocessed frame of the signal. Firstly the signal y (n) is converted to the

frequency domain by a DFT which leads to the energy spectrum. This is followed by

Gaussian filter bank block.

Figure 1 Steps involved in extraction of Gaussian MFCC and IMFCC [5]

Mathematically the equation for Gaussian filter is written as;

(2)

Where, k is coefficient index in the N-point DFT, b ik is a point between the thi

triangular filters boundary located at its base and considered as mean of thi Gaussian

filter while the i is the standard deviation or square root of variance and can be

written as,

ii bb

i

kk 1

(3)

Where; is the parameter where variance is being controlled.

Figure 2 Filterbank design [5]

MFCC

IMFCC

DCT

DCT

()10Log

()10Log

Speech

Signal Pre-

Processing ||2

FFT

MFCC Filter Bank

Gaussian

Gaussian

IMFCC Filter Bank

Gaussian

2

2

2

)(

i

bi

MFCC

kk

g

i e



Two plots in a single figure are shown in above figure 2. One is for triangular

filter and the other is for the Gaussian filter. This plot is made by considering a single

value of sigma. Here in this case by considering different values for sigma plot can be

drawn respectively. Fig 4 and Fig 5 shows the individual response for Gaussian filter

bank of MFCC and IMFCC.

Figure 3 Mel scale Gaussian filter bank [6]

Figure 4 Inverted Mel scale Gaussian filter bank [6]

Mathematically, the Gaussian MFCC can be written as,

1)(1

Q

ffffifff

F

Mk

lowmelhighmel

lowmelmel

s

sbi

(4)

sM is a number of points in DFT, sF is the sampling frequency, lowf and highf are low

and high-frequency boundaries of a filterbank, Q is the number of filters in the bank

and 1

melf is an inverse of the transformation.

)110(700)( 2595/1 fmel

melmel ff (5)

The inverted Mel Scale Filterbank structure can be obtained by just flipping

original filterbank around the midpoint of frequency range that is being considered.

)6.(..............................12

)( 1

'

k

Mk s

iqi

Where,

)(' ki is the original MFCC filter bank response.

0 1000 2000 3000 4000 5000 60000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frequency (Hz)

Wei

ght

0 1000 2000 3000 4000 5000 60000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frequency (Hz)

Wei

ght


GMM-UBM


These filter banks are being forced on the energy spectrum obtained by taking

Fast Fourier transform of the preprocessed signal as follow:

2

1

2 )(.|)(|)(

Ms

k

g

i

g kkYieMFCC

MFCC (7)

Where, )(ki is respective filter response and 2)(kY is the energy spectrum.

1ibk ibk

1ibk

Figure 5 Response )(ki of a typical Mel scale filter [5]

Finally, DCT is taken on the log filter bank energies })]}({{log[ Q

iie and the final

MFCC coefficients can be written as-

)8.(].........)2

12(cos[)]1([log

2 1

0 Q

lmie

QC MFCCMFCC g

Q

l

g

m

Where; 10 Rm , R is the desired number of Cepstral features. The same procedure for

extracting the IMFCC features as well [4] and are denoted as;

)9]......()2

2(cos[)]1([log

2 1

0 Q

llmie

QC IMFCCIMFCC g

Q

l

gm

3. LINEAR PREDICTIVE CEPSTRAL COEFFICIENTS (LPCC)

The predictive coefficient can be determined by minimizing the squared differences

between actual speech samples and linearly predicted values. This set is a unique set

of parameters. In practice, the actual predictor coefficients are never used as it is

because of their high variance. These predictor coefficients are transformed to a more

robust set of parameters known as Cepstral coefficients. The procedure for extracting

the LPCC is same as that of MFCC and IMFCC respectively. In this also we are going

to use Gaussian filter bank.

Figure 6 Block diagram of LPCC algorithm [7]

1 Amp

l

i

t

u

d

e

Speech

Sequence Pre-emphasis

and hamming

window

Linear

Predictive

Analysis

Cepstral

Analysis

LPCC

DFT coefficient index

Amplitude



Pre-emphasis and Hamming Window

The first block is a pre-emphasis block; the input signal is given to, the first step

of the algorithm is pre-emphasis. The idea of pre-emphasis is to spectrally flatten the

speech signal and equalize the inherent spectral tilt in a speech [8]. Pre-emphasis is

implemented by a first order FIR digital filter. The following equation shows the

transfer function of the pre-emphasis digital filter,

11)( zZH p (10)

Where, alpha is constant, which has a typical value of 0.97.

After pre-emphasis, the speech signal is subdivided into frames. This process is

the same as multiplying the entire speech sequence by a windowing function,

][][][ mnwnsnsm (11)

Where s[n] is the entire speech sequence, sm[n] is a windowed speech frame at time m

and w[n] is the windowing function.

The typical length of a frame is about 20-30 milliseconds. In the above equation,

m is the time shift or the step size of the windowing function. A new frame is obtained

by shifting the windowing function to a subsequent time. The amount of shifting is

typically 10 milliseconds. The shape of the windowing function is important.

Rectangular window is not recommended since it causes severe spectral distortion

(leakage) to the speech frames [9]. Other types of windowing function, which

minimize the spectral distortion, should be used. One of the most commonly used

windows is the Hamming window.

1

2cos46.054.0][

Nnw

(12)

In the above equation, N is the length of the windowing function. After Hamming

windowing, the speech frame is passed to the next stage for further processing.

Linear predictive analysis

In human speech production, the shape of the vocal tract governs the nature of the

sound being produced. The main idea is based on basic speech production model; it

says that vocal tract can be modeled by an all-pole filter. These are nothing but the

simple coefficient of all-pole filter. They are same as smooth envelope of log

spectrum of speech. The main idea behind LPC is that a given speech sample can

have approximated as a linear combination of the past speech samples. LPC models

signal s (n) as a linear combination of its past values and present input (vocal cords

excitation). If the signal will be represented only in terms of the linear combination of

the past values then the difference between real and predicted output is called

prediction error. LPC minimizes the prediction error to find out the coefficients.

The cepstrum is the inverse transform of the log of the magnitude of the spectrum.

Useful for separating convolved signals (like the source and filter in the speech

production model). Log operation separates the vocal tract transfer function and the

voice source. Vocal Tract filter has slow spectral variations and excitation signal has

high spectral variations. Generally provides more efficient and robust coding of

speech information than LPC coefficients.


GMM-UBM


Figure 7 LPCC [10]

The predictor coefficients are rarely used as features, but they are transformed

into the more robust Linear Predictive Cepstral Coefficients (LPCC) features. The

LPC are obtained using Levinson-Durbin recursive algorithm. This is known as LPC

analysis. The difference between the actual and the predicted sample value is termed

as the prediction error or residual [11] and is given by,

)()()( nsnsne

p

k

k knsans1

)()( (13)

)14(..............................1,)()( 0

0

aknsanep

k

k

Optimal predictor coefficients will minimize this mean square error. At minimum

value of E,

)15...(........................................,...2,1,0 pkE

ak

Differentiating and equating to zero we get,

= (16)

Where, )18.......(........................................)]()...2()1([

)17(......................................................]...[ 21

r

r

p

prrrr

aaaa

Where ‘R’ is the Toeplitz symmetric autocorrelation matrix given by,

)0(......)1(

....

....

)2(...)0()1(

)1(...)1()0(

rpr

prrr

prrr

R

Equation can be solved for predictor coefficients by using Levinson’s and Durbin

algorithm as follows:

)20..(....................|][|.][

)19...(........................................].........0[

)1(

1

1

1

)0(

i

L

j

i

j

iE

jirairk

rE

ka



Where,

pi 1

kai

j (21)

1)1( .

i

jii

i

j

i

j akaa (22)

12).1( i

i

i EkE (23)

The above set of equations is solved recursively for i=1, 2…p. the final solution is

given by

)( p

mm aa , pmwhere 1, (24)

Where; sam ' are linear predictive coefficients (LPC)

Cepstral Analysis.

In reality, the actual predictor coefficients are never used in recognition, since

they typical show high variance. The predictor coefficients are more efficiently

transformed to a robust set of parameters known as Cepstral coefficients

Before going to the definition of Cepstral coefficients, let us go through the

definition of the Cepstrum. A cepstrum is nothing but the result of taking the Fourier

transform of the logarithm of the estimated spectrum of a signal. The three different

types of cepstrum are the power cepstrum, complex cepstrum and the other one is real

cepstrum. Among them, the power cepstrum, in particular, finds application in the

analysis of human speech. The name cepstrum was derived from the word spectrum

by reversing the first four letters.

The steps through which the input speech signal goes through are preprocessing

then feature extraction and after that modeling. After preprocessing, the signal

reduces complex complexity while operating on speech signal. In this one particular

reduces the number of samples of operations. It is very difficult to work on huge set

of samples; therefore instead of working on such a large set of samples, we restrict

our operations to a frame of sufficiently reduced length. After the signal conditioning

or after pre-processing the speech signal goes through the feature extraction stage.

Here the features are extracted by using DCT. That is calculating the coefficients

using DCT.

Mathematically;

)))((log( windowyFFTabsdctCeps (25)

The principal advantage of Cepstral coefficients is that they are generally

decorrelated and this allows diagonal covariances. However, one minor problem with

them is that the higher order Cepstral are numerically quite small and this results in a

very wide range of variances when going from the low to high Cepstral coefficients.

Cepstral coefficient can be used to separate the excitation signal (which contains

the words and the pitch) and the transfer function (which contains the voice quality).

The cepstrum can be seen as information about rate of change in the different

spectrum bands. The recursive relation between the predictor coefficients and


GMM-UBM


Cepstral coefficients is used to convert the LP coefficients (LPC) into LP Cepstral

coefficients kc

)26....(..................................................ln 2

0 c

kmk

m

kmm acm

kac

1

1

pm1 (27)

)28.(........................................1

1 kmk

m

km acm

kc

Where 2 the gain term in the LP analysis and d is is the number of LP Cepstral

coefficients.

4. GAUSSIAN MIXTURE MODEL (GMM) AND UNIVERSAL

BACKGROUND MODEL

The text independent speaker recognition system used in this paper uses GMM-UBM

approach for modeling purpose. Generally; two models are being developed here, one

is target speaker model and other is impostor model (UBM). It has generalization

ability to handle unseen acoustic pattern [12].

In a biometric system, GMM is commonly used as a parametric model of

probability distribution continuous measurements or features. The features used are

generally vocal tract features in any speaker identification system. As we all know

that GMM are more likely used for text-independent speaker identification as the

prior knowledge about what speaker will say. Hence modeling is generally done in

GMM.A Gaussian mixture model is a weighted sum of M component Gaussian

densities as given by the equation [13].

Where; x is a D-dimensional continuous-valued data vector (i.e. measurement

or features),

, i = 1….. M are the mixture weights, and

, i = 1… M is the component Gaussian densities. Each component density is a

D-variate Gaussian function of the form;

With mean vector and covariance matrix the mixture weights satisfy the

constraint that

The complete Gaussian mixture model is parameterized by the mean vectors,

covariance matrices and mixture weights from all component densities. These

parameters are collectively represented by the notation,

i=1 … M ………. (31)

)29......(..........),|()|(1

M

i

iii xgwxp

iw

),|( iixg

)30)}....(()'(2

1exp{

||)2(

1),|( 1

2/12/ iii

i

Dii xxxg

},,{ iii

i i

11 i

M

i



For a sequence of T training vectors .The GMM likelihood, assuming

independence between the vectors, can be written as

(32)

For utterances with T frames, the log-likelihood of speaker models is;

)33(..........)|(log)|(log)(1

T

t

stss xpXpXL

For speaker identification, the value of )(XLs is computed for all speaker models

s enrolled in the system and the owner of the model that generates the highest value

is the returned as the identified speaker. During training phase, Feature vectors are

being trained using Expectation and Maximization (E&M) algorithm. An iterative

update of each of the parameters in , with a consecutive increase in the log

likelihood at each step.

GMM are generally used for text-independent speaker identification. The

drawback of the previous systems is being overcome by using GMM-UBM. It

overcomes on the cost of the mode; it is not as expensive that of the GMM. There is

no need for the vocabulary database or big phoneme. GMM is more advantageous

than HMM.

Capturing the general characteristics of a population and accordingly adapting it

to individual speaker is the basic idea of UBM. In other words more briefly UBM is

defined as the model which is used in many application areas but one of them is

biometric system which is used to compare the person’s independent feature

characteristics against person specific feature model during decision of acceptance or

rejection. UBM is also said as GMM only with large set of speakers.

The UBM is trained with the EM algorithm on its training data. For the speaker

recognition process, it fulfills two main roles:

It is the apriori model for all target speakers when applying Bayesian adaptation to

derive speaker models and it helps to compute log-likelihood ratio much faster by

selecting the best Gaussian for each frame on which likelihood is relevant. This work

proposes to use the UBM as a guide to discriminative training of speakers [14].

5. COMPARATIVE RESULTS OF FUSED MEL FEATURE SETS

AND LINEAR PREDICTIVE CEPSTRAL COEFFICIENTS

The main method focus in this paper is fusion of the both algorithms that are used

both MFCC and IMFCC respectively. The main aim is to compare the fused results to

the results obtained from LPCC. That is here the accuracy obtained from the fused

Mel feature set is compared with the Linear predictive Cepstral coefficient. The better

results among them will give us the accurately identified speaker respectively among

the database used for it. The system performs better if the two or more combination of

them were supplied with information that is complementary in nature. For obtaining

the identification accuracy MFCC and IMFCC features which are complementary to

each other can be fused together. There are many possible ways for combining such

as; product, sum, minimum, maximum, median, average etc, can be used. The sum

rule outperforms as compared to the other combinations and is most resilient to

estimation errors.

Let us go through the block diagram of Fused Mel feature set along with LPCC

with GMM-UBM modeling technique.

T

t

txpXp1

)|()|(

},.....{ 1 TxxX


GMM-UBM


From figure 7 and 8 we can say that system includes training and testing for fused

Mel feature Set and LPCC feature set. The implementation is done on TIMIT

database. TIMIT corpus is one of the standard databases used by the many researchers

for the purpose of speaker identification. This paper also concentrates on the TIMIT

database. It comprises of the 16 speakers.

Figure 8 Steps involved in speaker identification system (fused Mel features sets) [5]

Figure 9 Speaker identification system (LPCC) [6]

The recordings are from 8 dialect regions. Each speaker has 10 utterances

respectively Total 160 sentences recordings (10 recordings per speaker). The audio

format is .wav format, single channel, 16 kHz sampling, 16-bit sample, PCM

encoding.

The features are being extracted by using Gaussian Mel scale filter bank. The

feature vectors are trained by using Expectation Maximum algorithm. From the

diagram, we can say that separate model is being created for each speaker [5].

Features are extracted from the incoming test signal and then the likelihood of these

features with each of the speaker model is determined. These are included in the

testing step. The likelihood for MFCC and IMFCC as well as for LPCC is determined.

We have drawn two separate block diagrams for fused Mel feature sets and LPCC. In

first diagram a uniform weighted sum rule is adopted to fuse the scores from the two

classifiers.

(34) i

IMFCCMFCCii

com SwwSS )1(



is combined score of MFCC and IMFCC, and , are scores generated by the

MFCC model and scores generated by IMFCC Model and w is fusion coefficient. On similar Line, we

calculated the values for LPCC and are denoted byi

LPCCS .

The accuracy for Fusion and LPCC are calculated and are compared. The usage of

weights and number of mixtures can be changed to different values to test the system

for optimum result.

Table I shows the performance level of proposed system for different weights and

mixtures. As stated we are using standard database TIMIT of 16 speakers, we need to

divide these into two for training and testing purpose. For this purpose, the UBM

consist of 5 speakers and GMM 11 speakers respectively. The background model is

generated by UBM. The value of alpha that is filtering constant is kept as 0.97

respectively. The accuracy is being calculated on the basis of False positive and False

negative. In false positive a false speaker is accepted as true one. While in False

Negative, a true speaker is rejected as an impostor. The formula for accuracy

calculation is:

Accuracy in percentage=100-((FP+FN)*100/ (M*N))

Where, M*N= size of the confusion matrix.

Table I. Comparative Results for different number of Mixtures and weights for given

proposed system

No. of

Mix-

tures

Score threshold=0.6 Score

threshold=0.77

Score

threshold=0.8

Score threshold=0.97

Fusion

(%)

LPCC

(%)

Fusion

(%)

LPCC

(%)

Fusion

(%)

LPCC

(%)

Fusion

(%)

LPCC

(%)

4 92.56 84.29 92.56 85.95 92.56 85.95 91.73 86.77

8 94.21 86.77 92.56 86.77 92.56 87.60 92.56 87.60

16 92.56 71.07 95.04 73.55 95.04 74.38 92.56 76.85

Figure 10 Graphical Representation of table 1

From above table, we can see that the various accuracy percentages we got for the

different values of the mixtures and the different values of score threshold. The value

of threshold increases the accuracy is increasing accordingly. But in all the accuracy

for the Fusion is good as compared to the LPCC. The performance of the fused

system exceeds the performance of LPCC. The percentage of maximum performance

is 95.04% and hence likewise we have found out good identification with limited

errors.

i

coms i

MFCCS i

IMFCCS


GMM-UBM


6. CONCLUSION

Many methods were used earlier for feature extraction. They include MFCC, IMFCC,

etc. These two algorithms worked individually well and give good accuracy. Though

the IMFCC help MFCC to improve its accuracy further, these two algorithms are

combined together and is called as fused Mel Feature set. In this, the Gaussian

Mixture Model is being evaluated for speaker identification. The performance is

increased by fusing the complementary information. As shown in table, the accuracy

has been calculated for LPCC and Fusion and is seen 95.04% at weight 0.77 and 0.8

respectively for Fusion which is better than 73.55 and 74.38 at weight 0.77 and 0.8 for

LPCC. The more enhancements may be done by changing the modeling technique

and by changing various combinations of weights.

The future scope may include an application of same database approach try to

develop a real-time application and also the system can be developed by using

artificial neural network based approach.

REFERENCES

[1] J. Kittler, M. Hatef, R. Duin, J. Mataz, On Combining Classifier, IEEE

Transaction, Pattern Analysis and Machine Intelligence, 20(3), pp.226-

239,March 1998.

[2] Rana, Mukesh, and Saloni Miglani, Performance analysis of MFCC and LPCC

Techniques in Automatic speech Recognition, International Journal of

Engineering and Computer Science, 3(8), pp.7727-7732, August, 2014

[3] Sridharan, Sridha & Wong, Eddie, Comparison of Linear Prediction Cepstrum

Coefficients and Mel-Frequency Cepstrum Coefficients for language

identification, Proceedings of International Symposium on Intelligent

Multimedia, Video and Speech Processing, pp. 95-98, 2-4 May 2001

[4] Chakroborty Sandipan, and Goutam Saha, Improved text-independent speaker

identification using fused MFCC & IMFCC feature sets based on Gaussian filter,

International Journal of Signal Processing, 5(1), pp. 11-19, 2009

[5] R. Shantha Selva Kumari , S. selva Nidhyananthan , Anand, Fused Mel Feature

sets based Text-Independent Speaker Identification using Gaussian Mixture

Model, International Conference on Communication Technology and System

Design , Procedia Engineering, 30, pp. 319–326, 2012

[6] Anagha S. Bawaskar, Prabhakar N. Kota, Speaker Identification Based on MFCC

and IMFCC Using GMM-UBM, International Organization of Scientific

Research (IOSR Journals), 5(2), pp. 53-60, March-April 2015

[7] Cheng, Octavian, Waleed Abdulla, and Zoran Salcic, Performance evaluation of

front-end processing for speech recognition systems, School of Engineering

Report. The University of Auckland, Electrical and Computer Engineering, 2005

[8] Rabiner, L. and Juang, B, Fundamentals of speech recognition, Prentice Hall,

Inc., Upper Saddle River, New Jersey, 22 April 1993

[9] Rabiner, L.R., Schafer, R.W., Digital Processing of Speech Signals, Prentice

Hall, 1978.

[10] Pallavi P. Ingale and Dr. S.L. Nalbalwar, Novel Approach To Text Independent

Speaker Identification, International Journal of Electronics and Communication


[11] Chang, Wen-Wen, Time Frequency Analysis and Wavelet Transform Tutorial

Time-Frequency Analysis for Voiceprint (Speaker) Recognition, National Taiwan

University.



[12] Pazhanirajan, S., and P. Dhanalakshmi, EEG Signal Classification using Linear

Predictive Cepstral Coefficient Features, International Journal of Computer

Applications, 73(1), pp. , 2013

[13] Chao, Yi-Hsiang; Tsai, W.-H.; Hsin-Min Wang, Discriminative Feedback

Adaptation for GMM-UBM Speaker Verification, Chinese Spoken language

Processing( ISCSL) 6th International Symposium on , pp.1,4, 16-19 Dec. 2008

[14] Manan Vyas, A Gaussian Mixture Model Based Speech Recognition System

Using Matlab, Signal & Image Processing: An International Journal (SIPIJ),

4(4), August 2013

[15] Amr Rashed, Fast Algorithm For Noisy Speaker Recognition Using Ann,

International journal of Computer Engineering & Technology, 5(2), 2014, pp. 12

- 18.

[16] Viplav Gautam, Saurabh Sharma,Swapnil Gautam and Gaurav Sharma,

Identification and Verification of Speaker Using Mel Frequency Cepstral

Coefficient, International Journal of Electronics and Communication


[17] Scheffer N, Bonastre. J.F, UBM-GMM Driven Discriminative Approach for

Speaker verification, Speaker and Language Recognition workshop, IEEE

Odyssey, pp.1-7, 28-30 June 2006.

comparative study of lpcc and fused mel feature sets … › ... › ijecet_06_09_010.pdf ·...

Documents