[ieee 2011 first international conference on informatics and computational intelligence (ici) -...

6
Support Vector Machine Classifier for Pattern Recognition Mohammad Farhan Institute of Engineering and Computing Sciennces University of Science ant Technology Bannu Pakistan [email protected] Mujeeb Abdullah Institute of Engineering and Technology (IET) University of Science and Technology Bannu Pakistan [email protected] Ghulam Kassem Int’l Centre of Wireless Co llaborative Research Shanghai Instt. of Microsystems and I.T (SIMIT) Shanghai China P.R [email protected] Siddique Akbar Institute of Engineering and Technology (IET) Gomal University D.I.Khan Pakistan [email protected] AbstractAutomatiuc speech recognition is carried out by Mel-frequency cepstral coefficient (MFCC). Linearly-spaced at low and logarithmic-spaced filters at higher frequencies are used to capture the characteristics of speech. Multi-layer perceptrons (MLP) approximate continuous and non-linear functions. High dimensional patterns are not permitted due to eigen-decomposition in high dimensional image space and degeneration of scattering matrices in small size sample. Generalization, dimensionality reduction and maximizing the margins are controlled by minimizing weight vectors. Results show good pattern by SVM algorithm with Mercer kernel. Keywords- Mel-frequency Cepstral Coefficient (MFCC), Linear Discriminant Analysis (LDA), k-Nearest Neighbor (kNN) classifier, Support Vector Machine (SVM), kernel, Mercer kernel I. INTRODUCTION Automatic speech recognition (ASR) has been improved remarkably over the last decade and being in use in a diversity of domains. These system consists of 1) feature extractor 2) classifier. Feature extraction is a linear process to reduce the dimensionality of feature vectors to a low level. LDA gives optimum performance due to built in Fisher-discriminant [1]. It is carried out by Linear Discriminant Analysis (LDA) algorithm by transforming parameter vectors into another feature space. Support Vector Machine (SVM) classifiers were introduced by Boser, Guyon and Vapnik in COLT-92 [4-6] using supervised learning techniques to classify data based on some previous training through inputs [7]. The Support Vector Machine (SVM) algorithm incorporates pattern classification and suitable for non-linear formulation as well. By introducing weight functions separability criteria can be controlled. The object classes appearing closer with each other in the output space result in misclassification are supposed to be heavily weighted in the input space [2]. The eigen-decomposition in high dimensional image space and degenerated scatter matrices in small sample size often create problems in pattern recognition and hence get support from Mel-frequency Cepstral Coefficient (MFCC) [2]. SVM are systems using hyper planes i.e. hypothesis space of linear functions in feature space of high dimensions. Hyper planes are trained with a learning algorithm to optimize and use statistical learning based learning bias. SVM has been tested for hand writing, face recognition in general pattern classification and regression based applications [11,12]. Even though having complex hierarchy and design, SVM provide good results as compared to neural networks. SVM method of classification imitates supervised learning involving identification referred as feature extraction and produce favorable outputs. The biggest advantage of SVM is easy to train and scale complex high dimensional data at the expense of kernel function as compared to neural networks [2,12]. II. BACKGROUND In pattern recognition, techniques using LDA optimize low dimensional representation of objects and focus on discriminant feature extraction [1]. Object classes appearing closer to each other in output space are supposed to be more heavily weighted in the input space [2]. The statistical technique discriminant analysis classifies objects according to exclusive groups based on a set of measurable features. Mel-Frequency Cepstral Coefficients (MFCC) are frequently used for speech processing [5]. At low frequency, linearly spaced filters and logarithmically spaced filters at high frequencies are used to capture phonetically characteristics of speech. Machine learning is a part of artificial intelligence to make computers learn from past experiences, input provided training data and help machines to make decision. SVMs are supervised learning techniques used to classify data based on some previous training through inputs [7]. Hyper planes are trained with learning algorithm. SVM do not over generalize the training data whereas neural networks might end up [3]. Neural networks offered good results for supervised and unsupervised 2011 First International Conference on Informatics and Computational Intelligence 978-0-7695-4618-6/11 $26.00 © 2011 IEEE DOI 10.1109/ICI.2011.52 272 2011 First International Conference on Informatics and Computational Intelligence 978-0-7695-4618-6/11 $26.00 © 2011 IEEE DOI 10.1109/ICI.2011.52 272

Upload: siddique

Post on 30-Mar-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

Support Vector Machine Classifier for Pattern Recognition

Mohammad Farhan

Institute of Engineering and Computing Sciennces University of Science ant Technology Bannu

Pakistan [email protected]

Mujeeb Abdullah

Institute of Engineering and Technology (IET) University of Science and Technology Bannu

Pakistan [email protected]

Ghulam Kassem Int’l Centre of Wireless Collaborative Research

Shanghai Instt. of Microsystems and I.T (SIMIT) Shanghai China P.R

[email protected]

Siddique Akbar Institute of Engineering and Technology (IET)

Gomal University D.I.Khan Pakistan

[email protected]

Abstract—Automatiuc speech recognition is carried out by Mel-frequency cepstral coefficient (MFCC). Linearly-spaced at low and logarithmic-spaced filters at higher frequencies are used to capture the characteristics of speech. Multi-layer perceptrons (MLP) approximate continuous and non-linear functions. High dimensional patterns are not permitted due to eigen-decomposition in high dimensional image space and degeneration of scattering matrices in small size sample. Generalization, dimensionality reduction and maximizing the margins are controlled by minimizing weight vectors. Results show good pattern by SVM algorithm with Mercer kernel.

Keywords- Mel-frequency Cepstral Coefficient (MFCC), Linear Discriminant Analysis (LDA), k-Nearest Neighbor (kNN) classifier, Support Vector Machine (SVM), kernel, Mercer kernel

I. INTRODUCTION

Automatic speech recognition (ASR) has been improved remarkably over the last decade and being in use in a diversity of domains. These system consists of 1) feature extractor 2) classifier. Feature ext raction is a linear process to reduce the dimensionality of feature vectors to a low level. LDA g ives optimum performance due to built in Fisher-discriminant [1]. It is carried out by Linear Discriminant Analysis (LDA) algorithm by transforming parameter vectors into another feature space. Support Vector Machine (SVM) classifiers were introduced by Boser, Guyon and Vapnik in COLT-92 [4-6] using supervised learning techniques to classify data based on some previous training through inputs [7]. The Support Vector Machine (SVM) algorithm incorporates pattern classification and suitable for non-linear fo rmulat ion as well. By introducing weight functions separability criteria can be controlled. The object classes appearing closer with each other in the output space result in misclassificat ion are supposed to be heavily weighted in the input space [2]. The eigen-decomposition in high dimensional image space and degenerated scatter matrices in s mall sample size often

create problems in pattern recognition and hence get support from Mel-frequency Cepstral Coefficient (MFCC) [2]. SVM are systems using hyper planes i.e. hypothesis space of linear functions in feature space of high dimensions. Hyper planes are trained with a learn ing algorithm to optimize and use statistical learn ing based learning b ias. SVM has been tested for hand writing, face recognition in general pattern classificat ion and regression based applications [11,12]. Even though having complex hierarchy and design, SVM provide good results as compared to neural networks. SVM method of classification imitates supervised learning involving identification referred as feature extract ion and produce favorable outputs. The biggest advantage of SVM is easy to train and scale complex h igh dimensional data at the expense of kernel function as compared to neural networks [2,12].

II. BACKGROUND

In pattern recognition, techniques using LDA optimize low dimensional representation of objects and focus on discriminant feature extraction [1]. Object classes appearing closer to each other in output space are supposed to be more heavily weighted in the input space [2]. The statistical technique discriminant analysis classifies objects according to exclusive groups based on a set of measurable features. Mel-Frequency Cepstral Coefficients (MFCC) are frequently used for speech processing [5]. At low frequency, linearly spaced filters and logarithmically spaced filters at high frequencies are used to capture phonetically characteristics of speech. Machine learning is a part of artificial intelligence to make computers learn from past experiences, input provided train ing data and help machines to make decision. SVMs are supervised learning techniques used to classify data based on some previous training through inputs [7]. Hyper planes are trained with learn ing algorithm. SVM do not over generalize the training data whereas neural networks might end up [3]. Neural networks offered good results for supervised and unsupervised

2011 First International Conference on Informatics and Computational Intelligence

978-0-7695-4618-6/11 $26.00 © 2011 IEEEDOI 10.1109/ICI.2011.52

272

2011 First International Conference on Informatics and Computational Intelligence

978-0-7695-4618-6/11 $26.00 © 2011 IEEEDOI 10.1109/ICI.2011.52

272

Page 2: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

learning but still need a lot to do. Feed forward and recurrent networks have been used in mult ilayer perceptrons (MLP) which approximate functions exhib iting continuous and non-linear behavior [14]. A large number of hyper planes can separate the data into classes and maximum margin classifiers are used. SVM classifies and approximates training data and generalize better on given data sets [3,12]. By introducing alternative loss function, SVMs are capable of solving regression issues [9.12]. Natural text documents can be classified depending on their predefined categories. Kernels provide an ideal feature mapping fo r constructing a Mercer kernel [15]. SVMs are easier than neural networks to train, scale complex h igh dimensional data. The only problem is to find a good kernel function for guiding the SVM [2,12]. MFCC also extracts these characteristics from a speech data which are distinct for each word and are generally used to differentiate between a vast set of unique words.

III. THEME

Classification accuracy can be improved if weighting functions are introduced in LDA. High dimensional patterns don’t allow this scheme due to sample size and computation of eigen-decomposition of matrices. MFCC possessing good accuracy, is used to extract feature which associate with another feature reduction technique in LDA. MFCC extracts these characteristics from a distinct speech data for each word and used to differentiate between a vast set of words. Converts analogue speech signals into digital form by sampling to represent original signal and quantized to get a digital signal. SVM systems use hyper planes and trained with optimized learning algorithm. SVMs are basically classifiers and are used for regression. Feed forward and recurrent networks are used in mult i-layer perceptrons (MLP) approximat ing continuous and non-linear functions. For kernel trick, inner product of dataset are used for the expression of the algorithm and orig inal data is accepted through a non-linear mapping algorithm. Classificat ion is comprised of training and testing of data. Prediction is performed on data instances through SVM. Introducing an alternative loss function in SVM, linear and non-linear regression problems can be solved. Non-linear data is mapped into a high dimensional feature space.

IV. METHODOLOGY

Sample mean is defined as

iCx

T

i

ixWn

m 2,1,1~ (1)

If the variance between the projected classes is small then large difference implies well-separated classes. Increasing difference and rescaling of w cannot give an adequate separation between the samples of projected classes. The spread of projected samples measure the variation for each

class which in turn enables to compare the distance of the projected means and variat ion of the available data. The scatter Si and the significant cost function is defined as:

2,1,)~(~ 2 1~( imxWS

iCxi

Ti (2)

22

21

2

21~~

SS

mmJ

S

~m (3)

J can be expressed in terms of projection direction vector w for the N-dimensional data and C-class case. J is derived by within-class scatter Sw and between-class scatter matrices SB are defined by

C

i

TiiiB mmmmnS

1))(( (4)

C

i iW SS1

(5)

where m is the mean of training images, ni the number of samples in the ith class. If the number of samples are lesser than the dimensionality of samples data i.e . Small Sample Size (SSS), yields a singular matrix called Within class Sw. The N-dimensional matrix Sw has at most P C non-zero eigenvalues for the data set of P N -dimensional train ing samples representing C classes. When N is greater than the matrix rank (N P –C), Sw matrix is singular. The N maximum rank equal to C-1 as it is generated as the sum of C matrices of rank one and only C-1 are independent. The generalized eigenvectors and eigenvalues of Within-class and Between-class scatter matrices are obtained via

VVSVS WB V (6)

where V and ^ are eigenvectors and eigenvalues respectively. Eigenvectors associated with eigenvalues from higher to lower and first C -1 eigenvectors are retained form the Fisher basis vectors. The non-centered images are projected onto the Fisher basis vectors by finding the dot product of the image with each of the Fisher basis vectors.

For linear SVM, separating data with a hyper p lane and extending to non-linear boundaries by using kernel trick [12] and correct classificat ion of all data is

i) wxi + b ≥ 1, if yi= +1

ii) wxi + b ≤ 1, if yi= -1

iii) yi (wi + b) ≥ 1, for all i

where w is a weight vector. Depending upon the quality of the training data to be good and every test vector in the range of radius r from the training vector, the hyper plane selected is considered at the farthest possible region from the data. Linear constraints are required to optimize the quadratic function

bxxyxf iii *)( yyi (7)

273273

Page 3: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

Feature Space Representation [12, 13]

For data with linear attribute, a separating hyper plane suffice to d ivide the data, however, in most cases the data is non-linear and inseparable [12,13]. Kernels use the non-linearly mapped data into space of high d imensions and distinguish the linear data [7]. SVMs have been successfully applied to handwriting, face recognition in general pattern classification and regression based applications. If p ixel maps are used SVM give accurate results over complicated neural networks [2]. Neural networks use traditional Empirical Risk Minimizat ion (ERM) but SVM use Structural Risk Minimizat ion (SRM) princip le and show good results [8, 9]. SRM minimize the upper bound on error risk and ERM min imize the error risk of training data. SVM are used as classifiers but recently been used for regression purposes [10,14]. After transforming data into feature space, similarity measure can be defined on basis of dot product. A good choice of feature space makes the pattern recognition easier [7]. Kernel define the mapping gppp

)().(),(, 212121 xxxxKxx (8)

SVMs have been successfully applied to hand writ ing recognition, face recognition and in general pattern classification and regression based applications [10]. SVM use Structural Risk Minimizat ion (SRM) princip le and show good results [8,9]. If pixel maps are used SVM give accurate results over complicated neural networks [2]. Neural networks use traditional Empirical Risk Minimization (ERM) but SRM min imize the upper bound on error risk and ERM min imize the error risk o f train ing data. The kernel trick allows SVM to extrapolate non-linear boundaries.

Kernel functions enable operations into input space instead of complex high d imensional feature space. Based upon reproduction of Kernel Hilbert Spaces [12] inherently put the attributes of inputs into feature space. The Gaussian and polynomial kernels are standard choice. The Mercer kernels provide ideal feature mapping [11,13]. The speech signals component at low frequency carry more informat ion. Mel-scale is a special measurement unit of the sound pitch and is linear at frequencies below 1 KHz and logarithmic at frequencies above 1 KHz. Speech signal is a convolution of glottal pulse and vocal impulse response. The components at l MFCC is a de-convolution algorithm and used to extract vocal t ract impulse response. Cepstrum is obtained by transforming the speech signal into sum of two components. Mel-scaling technique highlights the important informat ion carriers of speech signals.

Frequency (Mel Scaled) = [2595log (1+f (Hz)/700] (9) The Mercer kernels provide an ideal feature mapping [11,13]. Each of projected train ing vectors is subtracted from the pro jected test sound vector and squared to get the distance. The sum of each column and its square roots is calculated to get a row matrix. The mean and min imum value in array is calcu lated. In SVM, Analogue speech signals are converted into digital and quantized v ia. x[n] = x(nT) where n is the no. of sample, T sampling period and 1/T is frequency of sample Fs in samples per second. Discrete Fourier Transform pair and complex cepstrum x^[n] are defined as:

1

0)1()(

N

n

kNWnxnX (10)

)2

(N

j

N eW)j

e (11)

][][

]})}[*][{{log(][

nunh

nunhDFTIDFTnx

[[

[[[

uh

I (12)

Where x̂ [n], u^[n] and h^[n] are complex cepstrums of h, u and respectively [10].Framing: Digital signals are segmented into a number of small frames of time length i.e. 20-40 milli-seconds. The human speech exhib its a quasi-stationary behavior in short durations therefore non-stationary signals are segmented into quasi-stationary frames. The transition from frame to frame is smoothened by making each frame to overlap the preceding frame through a predefined scale.

Windowing: The commonly used Hamming window is

,10,

,0

1

2cos46.054.0

)( 1

00

0000,0

sss1

0Nn

otherwise

N

m

nw (13)

274274

Page 4: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

Generally, the Gaussian and polynomial kernels are used [11] to map nonlinear data into a space of high dimensions but here Mercer kernels are used for an ideal feature [13]. k-Nearest Neighbor: The projected train ing vectors are subtracted from the projected test sound vector and squared to get the distance. The sum of each column and its square roots, mean and min imum value in array are calcu lated. Uttered portion of speech signals is passed as an argument to the MFCC function to obtain Mel-freuency cepstrum coefficients. The same process is carried in each iterat ion for the total number of speech samples (25 in this case). All feature vectors in a class are centered by subtracting the mean of a class from each of the feature vectors. Centered feature vectors are mult iplied with their respective transposes and added to form within scatter class matrix and get the centered image. The sum of the scatter matrices gives the covariance matrix within class scatter matrix. The centered mean is multip lied by its transpose and added to form between class matrix. Eigenvalues and eigenvectors are formed and rearranged in descending values to form LDA-eigenspace. Each of the feature vectors in audio data is multiplied by the transpose of LDA-eigenspace to get it projected.The index position of minimum entry of array is given by 0.3 * mean (array). If the index position is between 0-5, then test sound file is matched with the first class (ONE in our case) and if it is between 6-10, then test sound file is matched with the second class and recognized word is (TWO and so on). The result for the sound file chosen from the testing sound and it was observed as the 5th sound file as there are two test sound files for each word.

V. RESULTS

The audio data matrix and sound file matrix are shown in Table I and Table II. LDA process performs five classes per person and each having five samples. The centered feature vectors mult iplied with their respective transposes and added to form within scatter class matrix. The means of the five classes are centered by subtracting the total mean from each of the mean of all classes. The centered means are multip lied and this mean is subtracted from each feature vector in the class to get the centered image and the scatter matrix o f each class is found by multiply ing the matrix with its transpose. The sum of the scatter matr ices gives the covariance matrix which is Within class scatter matrix. Getting MFCC features, multip lied with the transpose of LDA-eigenspace to be get it projected and is shown in Table III and LDA projection matrix is given in Table IV. All the five classes of data are distinct and linearly separable. Projection of test file in LDA-space and recognized word in Fig. 1-3. The result for the sound file chosen from the testing sound and it was the 5th sound file as there are two test sound files for each word. Hence it is shown that SVM efficiency is reduced when the dimensionality of the training data is increased. It is shown that if LDA and SVM are both used together produce very good results as shown in Fig.4. And

SVM to audio data before LDA and LDA projection train are shown in Fig. 5 and Fig. 6 respectively.

VI. CONCLUSIONS

LDA is used as feature extractor along-with k-Nearest Neighbor (kNN). The efficiency depends upon the choice of kernel. If the noise frequencies are cance lled and a clean signal taken into account out then it is applicable to the sound signals significantly. The speech recognition are carried out for real test. The input to SVM is the audio data gotten from the sound file which gives prediction accuracy of 60%. And input to the SVM was the LDA data (data after performing LDA on it), which gave an accuracy of 100%. This shows that LDA and SVM together produce 100% accurate result and the prediction rate for more than 40%. It is shown here SVM efficiency is reduced when the dimens ionality of the training data is increased. It is observed that if LDA and SVM are both used jointly it produce accurate results.

TABLE I. AUDIO DATA MATRIX OF FEATURE VECTORS

TABLE II. SOUND FILE MATRIX

275275

Page 5: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

TABLE III. LDA EIGEN SPACE

TABLE IV. LDA PROJECTION MATRIX

Figure 1. Test file projection in LDA-space

Figure 2. Test file projection

Figure 3. Word recognized

Figure 4. SVM after LDA Cross-validation

Figure 5. SVM to Audio data before LDA

Figure 6: LDA Projection train

REFERENCES

[1] Berlhumeur P., Hespanha J., and Kriegman D., “Eigenfaces vs. Fisherfaces”, 1997.

[2] Recognition Using Class Specific Linear Projection, “ IEEE Trans. PAMI”, 19 (7): pp. 711-

[3] Nello Cristianini and John Shawe-Taylor, “An Introduction to Support Vector Machines and other Kernel-based Learning Methods”, Cambridge University Press, 2000.

[4] Burges C., “A Tutorial on support vector machines for pattern recognition”, In “Data Mining and Knowledge Discovery”. Kluwer Academic Publishers, Boston, 1998, (Volume 2)

[5] V. Vapnik, S. Golowich, and A. Smola. “Support vector method for function approximation, regression estimation, and signal processing”, In M. Mozer, M. Jordan, and T . Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281– 287, Cambridge, MA, 1997. MIT Press.

[6] C. Cortes and V. Vapnik, “ Support vector networks. Machine Learning”, 20:273 – 297, 1995.

[7] N. Heckman, “The theory and application of penalized least squares methods or reproducing kernel Hilbert spaces made easy”, 1997.

276276

Page 6: [IEEE 2011 First International Conference on Informatics and Computational Intelligence (ICI) - Bandung, Indonesia (2011.12.12-2011.12.14)] 2011 First International Conference on Informatics

[8] David M. Skapura, “Building Neural Networks”, ACM Press, 1996. [9] M. Farhan,, “Investigation of Support Vector Machine as Classifier”,

MS Thesis Nottingham University, Malaysia Campus, 2010.

[10] Tom Mitchell, “Machine Learning”, McGraw-Hill Computer science series, 1997.

[11] E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for support vector machines”, In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII — Proceedings of the 1997 IEEE Workshop, pp. 276 – 285, New York, USA, 1997.

[12] M. A. Aizeman, E. M. Braverman, and L. I. Rozono´er. “Theoretical foundations of the potential function method in pattern recognition learning. Automation and RemoteControl”, 25:821–837, 1964.

[13] N. Aronszain, “Theory of reproducing kernels”, Trans. Amer. Math. Soc., 686:337–404, 1950.

[14] M. O. Stitson and J. A. E.,”Weston. Implementational issues of support vector machines”, Technical Report CSD-TR-96-18, Computational Intelligence Group, Royal Holloway, University of London, 1996.

[15] Duda R. and Hart P., "Pattern Classification and Scene Analysis", Wiley, New York 1973

[16] V. Vapnik, “The Nature of Statistical Learning Theory”, Springer, N.Y., 1995.

[17] J. P. Lewis, “Tutorial on SVM”, CGIT Lab, 2004.

277277