[ieee 2012 1st international conference on recent advances in information technology (rait) -...

1st Int’l Conf. on Recent Advances in Information Technology | RAIT-2012 |

978-1-4577-0697-4/12/$26.00 ©2012 IEEE

Noise Classification

Using Gaussian Mixture Models

Hitesh Anand Gupta

Electronics and Communication Engineering Birla Institute of Technology, Mesra

Ranchi, India [email protected]

Vinay M Varma

Electronics and Communication Engineering Birla Institute of Technology, Mesra

Ranchi, India [email protected]

Abstract: Gaussian Mixture Models (GMMs) have been proven

effective in modeling speech and other acoustic signals. In this

study, we have used GMMs to model different noise sources,

viz. subway, babble, car and exhibition. Expectation

maximization algorithm has been implemented to fit the

model. Further, we present the ‘threshold’ method which uses

the energy coefficient of the Mel - Frequency Cepstral

Coefficients (MFCC) vector to determine the frames with noise

(no speech) data.

Keywords: GMM, Expectation Maximization, MFCC,

threshold method, Noise Classification.

I. INTRODUCTION

The problem of classification deals with recognizing patterns in the data that will separate dissimilar objects while grouping the similar ones. Hence, the key to classification is in the feature extracted from the data that gives us the required results. When this idea is applied to acoustic noise the features available at front is the time varying signal and its frequency spectrum.

Looking specifically at the problem, it is to classify the different acoustic noises that hinder normal speech processing (like Speech Recognition, etc.) like the sound of a running bus or a train, the constant whirl of a fan or even a random noise of a machine gun firing in the background. What can be noted is that each of these noise classes is peculiar to the human ear, i.e. we can distinguish between these noises.

Previously the noise classification problem has been approached using neural networks [1] and line spectral frequencies (LSFs) [2] algorithms. Hidden Markov Models (HMMs) have also been applied in this area with varying results [3]. Here we approach the problem of noise classification as a pre-processing utility in a speech recognition module [4], which uses Gaussian Mixture Models (GMMs) for classification.

Using the time varying signal itself as the feature is overburdening considering the high sampling rate in speech recordings. The FFT does bring this high data rate to a lower scale yet still falls short of being an ideal feature. Mel -

Frequency Cepstral Coefficients (MFCC) is presently a highly accepted and used feature. It works on a windowed time varying signal and breaks down the data in a window into 13 coefficients.

. The training data for each class is used to find the statistical characteristics of the data, i.e. a mean, a variance and hence an ideally defined Gaussian. Increasing the complexity of the statistical observation clustering of the defined Gaussians is done to form Gaussian Mixture Models of each noise class. Specifically, the Expectation Maximization algorithm is used to cluster the GMMs. Once this is done testing is done to verify the classification. The testing involves a maximum likelihood estimator as a measure of similarity between the classified data and the given data class.

This paper presents a solution to the noise classification problem applied to the Aurora 2 database. Section 2 covers different methods of classification along with a brief introduction on Aurora 2 data. In Section 3, a general experimental procedure is defined. Finally, in Section 4 the results and a detailed analysis of the same are discussed.

II. METHODS OF CLASSIFICATION

The Aurora 2 data comprises of the TIDigits data corrupted artificially with different kinds of noise. The details of the Aurora 2 data can be found in [5].

The train data has samples of SNRs 5dB to 20dB for Subway, Babble, Car and Exhibition, with noise corrupted speech, while the testa data has samples of the same noise classes for SNR -5 to 20 dB. Testb has four other noise classes namely; Airport, Train, Street and Restaurant, while testc has one noise class each from testa and testb.

A. Classification using the entire ‘noise+speech’ data

Training is done on a concatenated feature file of all train data for a particular SNR. Here since the data has both noise and speech data, the classifier will be trained using the mixture of the two.


B. Offset method for noise extraction

Usually a recording has a long silence at the start and end of the utterance, i.e. if we take the first and the last ‘n’ frames of the data then we can approximate this extracted data to be only noise. An offset window is applied to the original train data of a particular SNR which is concatenated to get the new train data.

C. Threshold method for noise extraction

This method is a more generalized form of noise extraction. Here, we specify a threshold level for the 13th coefficient of the MFCC features of the clean train data and for values lesser then this threshold, we record the frame numbers. (Energy of a signal is denoted in the 13th coefficient and has a lower value for clean data). The data present in these frame numbers in the noisy data is then concatenated giving us our training data.

D. Multivariate Training

As a refinement to the training method, the data for the training set is a concatenated matrix of all the feature files pertaining to all the different noise level data (SNR’s of -5, 0, 5, 10, 15 and 20 dB) together. This can be used in combination with any of the above mentioned methods.

III. EXPERIMENTAL PROCEDURE

A. Feature Extraction

MFCC feature was extracted for each file of the test and train data beforehand using the HTK toolkit [6]. The MFCC was a 13 dimension vector with the 13th coefficient being the energy (MFCC_0_D_Z_A).

B. Training

A concatenated train data matrix is made after following any of the methods described in Section II. The training was done with the maximum allowed iterations set to default of 100. The GMM was initialized to a randomly selected sample from the given data. The number of mixtures was maintained at 5 and only the diagonal covariance matrix was used. The trained model was saved to an object file.

C. Testing

The test data was broken into sentence, each of sentence length (Len_sent) of 500 frames. Len_sent is the minimum amount of data required by the classifier to make a decision. Then each sentence of the test data matrix (from 4 matrices corresponding to the 4 types of noise) was used to calculate the average log likelihood of the sentence with respect to each object from the training. Finally the object that gave the maximum average log likelihood was the required output. After iterating this procedure over the entire test data, the results were tabulated. A plot of accuracy over varying test SNR (SNR vs. Accuracy or SVA) for each of the classified noise types was plotted to find trends in the accuracy.

IV. EXPERIMENTAL RESULTS

Classification of ‘noise+speech’ signals is complicated mostly because of the Babble noise which is basically a mixture of many speech signals together. So, it is expected that when the entire utterance is tested, we will have a lot of discrepancies, especially with Babble. Table 1 is a summary of the result obtained using the various methods described above. An analysis of each of these methods is given here and the corresponding accuracy plots (SVA) are shown.

Fig. 1. SVA_full_testa

A. Full data (‘noise+speech’)

Fig. 1 is a plot of accuracy of the classifier, trained at 5dB and tested with data of varying SNR. A general conclusion we can frame from this plot is that the accuracy decrease with increasing test SNR. The noise class Car deteriorates at higher rate in comparison to Subway and Exhibition. On a closer look it becomes apparent that Babble behaves strikingly different from the other classes. With a high accuracy at higher dB, this class starts off low.

The first point can be rationalized by the increasing signal strength of speech in the signal, hence more confusion in the result. In general a dialogue or monologue has silence (or in this case, only noise) for a fraction of the whole recorded time. Hence, a sentence length of 100 means a lot of sentences with a majority of speech. And hence the deterioration in accuracy is justified.

The second point also is an anomaly caused by the speech signal itself. As Babble is an example of the “cocktail party” scenario [7], the higher dB gives good results due to the similarity in the signals. At low dB the same acts negatively due to the lack of loud speech.

When trained with just one type of data i.e. data at only one particular SNR, the accuracy for the test data fell


considerably for the testing of utterances of SNR 15dB and 20dB, particularly for the data mixed with the car noise. To overcome this drawback, we trained the model using the data from SNR’s 5dB to 20dB altogether.

As can be seen, from Fig. 2 the results considerably improved when the classifier is trained with multivariate data rather than just data from 5 dB. Subway and Exhibition maintain 100% accuracy till 15 dB after which it deteriorates a little. Car shows better results as compared to the previous results where its deterioration was considerably greater. Babble is almost unaffected by the whole procedure as it continues to give a pseudo 100% accuracy.

Fig. 2. SVA_full_testa_multi

B. Offset method

It was noticed that as the offset length was increased there was an improvement in the overall accuracy till the first speech syllable was encountered after which it deteriorates to the ‘noise+speech’ levels. Due to the high variations in the offset period of the data, further study was not conducted.

C. Threshold method

Fig. 3 the SNR vs. Accuracy plot when trained with the normal train data and tested with the testa data while using the threshold method of extraction. The training was carried out using only the 5dB train data. The sentence length considered was 100. It can be clearly seen that the accuracy, esp. that of ‘Car’, greatly improved over that obtained when training and testing with the ‘noise+speech’ data.

Fig. 4 is the plot for multivariate train data for the threshold method. As can be seen, the accuracies, even for higher SNR’s are very high and hence lead us to think that this method is the ideal one for classifying the noise.

D. Other test data sets (testb and testc)

Fig. 5 shown below was plotted to check the integrity of the trained classifier, i.e. the classifier was trained using the train data viz. Car, Babble, Exhibition and the Subway thus creating four models respectively. Then the models were tested using testb data which as described above is of four different noise classes. One class that shows a fair amount of recognition is the babble class which will be further discussed in the next sub-section.

Fig. 3. SVA_onlynoise_testa

Fig. 4. SVA_onlynoise_testa_multi

Fig. 6 is a plot of the same model tested with testc data. Testc has only two noise classes, one from testa and the other from testb and as expected the noise class from testa was recognized successfully.

E. Babble- A Detailed Analysis

Due to its high likeness to speech, the babble is a peculiar type of noise. Our aim being to differentiate between four


particular kinds of noises, babble being one of them is simple at low SNR. At higher SNRs, babble continues to have a good result due to its homogenous speech like nature. While

other noises, particularly car, degrade rapidly (car noise is proposed to be the mildest compared to subway and exhibition). This is shown in Fig. 7.

TABLE I.

Fig. 5. SVA_mix5_len100_testanoise_testbnoise_multi



SNR (in dB) Subway Babble Car Exhibition Overall

Full -5 100% 86.3% 98% 100% 96%

0 100% 97.1% 99.8% 100% 99.2%

5 99.9% 99.1% 98.8% 99.7% 99.4%

10 99.6% 99.8% 90.6% 98.1% 97%

15 97.5% 99.6% 65.3% 92% 88.5%

20 90.2% 99.8% 34.3% 81% 76.2%

Full Multi -5 100% 92% 100% 100% 98%

0 100% 95.4% 98.8% 100% 98.8%

5 99.9% 97.8% 99.8% 99.9% 99.3%

10 99.8% 98.7% 98.7% 99.4% 99.2%

15 99.6% 98.8% 96.9% 98.3% 98.4%

20 96.8% 99.4% 91.1% 96.1% 95.8%

Only noise (Threshold

method)

-5 100% 95.8% 99.2% 100% 98.7%

0 100% 99.6% 100% 100% 99.9%

5 100% 100% 100% 100% 100%

10 100% 100% 100% 100% 100%

15 100% 100% 99.4% 100% 99.8%

20 100% 100% 89.3% 100% 97.3%

Only noise Multi

(Threshold method)

-5 100% 96.3% 100% 100% 99.6%

0 100% 99.8% 100% 100% 99.9%

5 100% 100% 100% 100% 100%

10 100% 100% 100% 100% 100%

15 100% 100% 100% 100% 100%

20 100% 100% 100% 100% 100%


Fig. 7. SVA_ mix5_len100_nobabble

V. CONCLUSION

Noise classification results have been presented using GMMs. The MFCCs have been used as features and the noise class is determined by the expectation maximization algorithm. The best results were observed for a multivariate data using the threshold method. The noise class babble was also studied and it was found that it led to wrong classifications due to its high similarity with speech. Thus, it has been shown that GMMs are robust modeling tools for non-speech classification.

The offset method has the disadvantage of having to wait for the entire utterance before being classified. It also has a variable offset length which may vary with each utterance and data. The threshold method gives results as and when the minimum length condition is satisfied though the value for threshold has to be assumed or evaluated beforehand.

Further, the threshold method requires the clean data as well as the noise corrupted speech data.

ACKNOWLEDGMENT

First and foremost, we would like to thank Prof. S. Umesh, PhD for providing us with an opportunity to work with him at Indian Institute of Technology, Madras. We would also like to thank Vikas Joshi and Raghavendra Bilgi without whom we wouldn’t have been able to do any of this.

REFERENCES

[1] Barkana, B.D., Saricicek, I., “Environmental Noise Source Classification Using Neural Networks,” Information Technology: New Generations (ITNG), 2010 Seventh International Conference, pp. 259 - 263, April 2010.

[2] El-Maleh, K., Samouelian, A., Kabal, P., “Frame level noise classification in mobile environments,” 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. ICASSP '99. Proceedings., vol.1, pp. 237 – 240, Mar 1999.

[3] Gaunard, P., Mubikangiey, C.G., Couvreur, C., Fontaine, V., “Automatic classification of environmental noise events by hidden Markov models,” Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998, vol.6 , pp. 3609 – 3612, August 2002 (Current Version).

[4] Fujimoto, M., Riki, Y.A., “Robust speech recognition in additive and channel noise environments using GMM and EM algorithm,” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04), vol.1, pp.941-944, August 2004.

[5] H. G. Hirsch and D. Pearce., “The aurora experimental framework for the performance evaluations of speech recognition systems under noisy conditions,” ICSA ITRW ASR2000, September 2000.

[6] HTK Toolkit: Open Source Speech Recogntition Software, http://htk.eng.cam.ac.uk/

[7] S. Haykin and Z. Chen, “The cocktail party problem,” Neural Computation, vol. 17, pp. 1875–1902, Sep 2005.

[ieee 2012 1st international conference on recent advances in information technology (rait) -...

Documents